Content uploaded by Rudolf Rübsamen
Author content
All content in this area was uploaded by Rudolf Rübsamen on Jun 11, 2018
Content may be subject to copyright.
Research Report
Crossmodal interactions and multisensory integration in the
perception of audio-visual motion —A free-field study
Kristina Schmiedchen
⁎, Claudia Freigang, Ines Nitsche, Rudolf Rübsamen
Faculty of Biosciences, Pharmacy and Psychology, University of Leipzig, Talstrasse 33, 04103 Leipzig, Germany
ARTICLE INFO ABSTRACT
Article history:
Accepted 6 May 2012
Available online 14 May 2012
Motion perception can be altered by information received through multiple senses. So far,
the interplay between the visual and the auditory modality in peripheral motion perception
is scarcely described. The present free-field study investigated audio-visual motion
perception for different azimuthal trajectories in space. To disentangle effects related to
crossmodal interactions (the influence of one modality on signal processing in another
modality) and multisensory integration (binding of bimodal streams), we manipulated the
subjects’attention in two experiments on a single set of moving audio-visual stimuli.
Acoustic and visual signals were either congruent or spatially and temporally disparate at
motion offset. (i) Crossmodal interactions were studied in a selective attention task.
Subjects were instructed to attend to either the acoustic or the visual stream and to indicate
the perceived final position of motion. (ii) Multisensory integration was studied in a divided
attention task in which subjects were asked to report whether they perceived unified or
separated audio-visual motion offsets. The results indicate that crossmodal interactions in
motion perception do not depend on the integration of the audio-visual stream.
Furthermore, in the crossmodal task, both visual and auditory motion perception were
susceptible to modulation by irrelevant streams, provided that temporal disparities did not
exceed a critical range. Concurrent visual streams modulated auditory motion perception in
the central field, whereas concurrent acoustic streams attracted visual motion information
in the periphery. Differential abilities between the visual and auditory system when
attempting to accurately track positional information along different trajectories account
for the observed biasing effects.
© 2012 Elsevier B.V. All rights reserved.
Keywords:
Audio-visual
Motion perception
Dynamic capture
Perceived unity
Trajectory
Attention
1. Introduction
Objects that surround us usually stimulate more than a single
sense. Various streams of information that belong to the same
object are integrated in the brain, while simultaneously those
streams that belong to different objects are separated. Over
the past decades, a wide range of studies dealing with the
perception of stationary events has led to a substantial
improvement of our understanding of how multisensory
inputs are combined into meaningful percepts (Calvert et al.,
2001; Ernst and Bülthoff, 2004; McGurk and MacDonald, 1976;
Senkowski et al., 2007; Teder-Sälejärvi et al., 2005; Werner and
BRAIN RESEARCH 1466 (2012) 99–111
⁎Corresponding author. Fax: +49 341 9736848.
E-mail address: k.schmiedchen@uni-leipzig.de (K. Schmiedchen).
0006-8993/$ –see front matter © 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.brainres.2012.05.015
Available online at www.sciencedirect.com
www.elsevier.com/locate/brainres
Noppeney, 2011). However, the specifics of the interplay, i.e.
the mutual influence between different senses in the percep-
tion of motion, are described in much less detail.
Motion perception requires the dynamic integration of spatial
changes over time. In vision, a moving light pattern consecu-
tively activates adjacent locations on the retina, which in turn
have to be associated by motion detectors in the visual cortex
(Albright and Stoner, 1995). The auditory system, in contrast, has
to track dynamically changing interaural time differences (ITDs)
and/or interaural intensity differences (IIDs) in order to infer
acoustic motion (Middlebrooks and Green, 1991). As such,
dynamic location coding achieved by a direct, retinotopic based
representation in the visual system and an indirect, recon-
structed representation in the auditory system fundamentally
differs between both modalities (Wilson and O'Neill, 1998).
A wide range of studies making use of various methodologies
such as single-cell recordings in animals, human behavior, EEG
and fMRI recordings or statistic modeling, all seem to indicate
that the combination of multisensory inputs requires the
analysis of spatial and/or temporal features between different
streams of sensory information. Asynchronous inputs are linked
as long as they fall within a defined temporal window (Colonius
and Diederich, 2010; Lewald and Guski, 2003; Meredith et al.,
1987; Navarra et al., 2005; Slutsky and Recanzone, 2001).
Similarly, spatial disparities between acoustic and visual inputs
are tolerated to a certain degree in static events (Bertelson and
Radeau, 1981; Hairston et al., 2003; Thurlow and Jack, 1973; Welch
and Warren, 1980)andmovingevents(Alink et al., 2008; Meyer
and Wuerger, 2001; Soto-Faraco et al., 2002; Stekelenburg and
Vroomen, 2009). Other studies have highlighted the crucial role
of the co-analysis of spatial and temporal features in multisen-
sory perception (Bolognini et al., 2005; Lewald and Guski, 2003;
Meyer et al., 2005; Recanzone, 2009; Soto-Faraco et al., 2004; Stein
and Meredith, 1993).
Importantly, multisensory integration refers to the binding
of stimuli perceived through multiple senses, whereas cross-
modal interactionsdescribe the directinfluence of one modality
on signal processing in another modality without necessarily
integrating information (Spence et al., 2009). In particular, the
term crossmodal localization bias is used to describe the
displacement of the perceived location of a signal in the
attended modality towards the location of a concurrent, but
ignored signal in another modality (Bertelson et al., 2000;
Hairston et al., 2003).
A well described crossmodal interaction phenomenon is the
ventriloquist illusion. Ventriloquism refers to a bias of per-
ceived information towards the information of a ‘competing’
sense that has either a more precise spatial or temporal
resolution. The first studies on ventriloquism in motion
perception supported the notion of a visual dominance (Allen
and Kolers, 1981; Mateeff et al., 1985), as the visual system was
found to provide more salient cues in perceiving motion than
the auditory or the somatosensory system. Crossmodal dy-
namic capture of auditory motion was also demonstrated by
more recent studies (Kitajima and Yamashita, 1999; Sanabria et
al., 2007b; Soto-Faraco et al., 2002; Strybel and Vatakis, 2004),
where an alteration of the perceived direction of auditory
motion towards that of the visual signal was attributed to a
‘capture’by the visual system which has superior spatial
resolution in the central field. In contrast, laterally moving
sounds were shown to induce motion perception of static visual
flashes (Hidaka et al., 2011), which demonstrates that the
auditory system can bias the visual system in the spatial
domain in instances where the visual resolution is inferior to
the auditory resolution. This finding is consistent with previous
studies employing stationary events (Alais and Burr, 2004;
Hairston et al., 2003). The unrestricted dominance of the visual
modality in motion perception was further challenged by
additional reports of modulatory effects of a moving sound on
the perceived direction of visual motion (Alink et al., 2012;
Brooks et al., 2007; Jain et al., 2008) or in resolving ambiguous
visual motion displays (Sanabria et al., 2007a). Also, the timing
of a stationary sound has been shown to affect the detection or
the perceived direction of moving visual signals (Freeman and
Driver, 2008; Getzmann, 2007; Kafaligonul and Stoner, 2010).
Taken together, the results of these studies suggest context-
dependent crossmodal interactions between the visual and the
auditory modality in motion perception and support the
hypothesis that perception is dominated by the modality
which provides the most reliable information (Battaglia et al.,
2003; Burr and Alais, 2006; Ernst and Banks, 2002).
Unlike crossmodal interactions, multisensory integration
necessarily implicates the binding of various streams of
information and yields a coherent percept. Ernst and
Bülthoff (2004) pointed out that the integration of multisen-
sory inputs reduces the variance between the individual
sensory estimates and leads to enhanced reliability of the
combined percept. Integration effects have been observed at
very distinct processing stages, from low-level structures
such as the superior colliculus (Meredith and Stein, 1983;
Meredith et al., 1987) to higher-order cortical areas such as
the superior temporal sulcus (Calvert et al., 2001; Werner and
Noppeney, 2010). Consequently, Meyer et al. (2011) suggested
that these different stages are responsible for the integration
of various contents of bimodal stimuli. That is, sensory
integration (e.g. spatial and temporal features) is likely
reflected in low-level structures, whereas semantic integra-
tion (e.g. linguistic contents) is assigned to higher-order
structures. Psychophysical studies investigating low-level
integration of motion features found improved speed estima-
tion (Wuerger et al., 2003) and an increase in the detection
rate (Meyer et al., 2005) in the bimodal condition. An fMRI
study by Lewis and Noppeney (2010) on higher-level integra-
tion revealed that audio-visual synchrony facilitates motion
discrimination. Furthermore, both behavioral and neurophys-
iological studies have shown that the integration of two or
more streams of information into a unified percept is optimal
for spatially and temporally aligned features (Stein and
Meredith, 1993; Wallace et al., 2004). In conflicting audio-
visual situations, the integration or separation of several
streams depends on the magnitude of the spatial and/or
temporal disparities, as previously suggested by the findings
of behavioral studies employing stationary events (Bertelson
and Radeau, 1981; Lewald and Guski, 2003; Slutsky and
Recanzone, 2001; Wallace et al., 2004).
A limitation of most of the previous studies is that moving
stimuli were mainly presented in the central visual field. To date,
interactions between the visual and the auditory system in
perceiving motion in the periphery—where the interplay between
the two senses might differ—have been seldom investigated.
100 BRAIN RESEARCH 1466 (2012) 99–111
The present free-field study aimed at better understanding
how crossmodal interactions and multisensory integration
contribute to audio-visual motion perception in central and
peripheral space. To this end, a single set of audio-visual
stimuli was employed in two different experimental tasks. The
subjects’attention was either (i) selectively directed to a single
modality or (ii) simultaneously directed to both modalities.
The first experiment of this study aimed at elucidating the role
of crossmodal interactions in the perception of motion in
space. Moving audio-visual stimuli and moving unimodal
stimuli (acoustic or visual), traveling along different prede-
fined trajectories at a stimulus duration of either 0.5 s or 2.0 s,
were randomly presented in the free-field (Fig. 1A). Audio-
visual stimuli were either congruent or spatially and tempo-
rally disparate at motion offset (Figs. 1B and C). In a blocked
design, subjects were instructed to selectively attend to either
the acoustic or the visual stream and to localize the final
position of the attended stimulus. At central motion offset
locations, we expected a shift of the perceived final position of
acoustic trajectories towards the displaced final position of
visual trajectories. However, as positional accuracy in visual
motion perception decreases towards the peripheral field
(McKee and Nakayama, 1984) and thus the reliability of visual
signals, we hypothesized that auditory motion perception
would be less influenced by concurrent visual information at
lateral motion offset locations.
In a second experiment, we studied multisensory integra-
tion of motion stimuli. Using the same audio-visual stimuli as
in the first experiment, subjects were instructed to simulta-
neously attend to both streams and to judge after each trial
whether the visual and the acoustic motion offsets were
congruent or incongruent. We hypothesized that increasing
spatio-temporal disparities between the visual and the
acoustic motion offset would lead to increased rates of
separation of both streams. Furthermore, we predicted that
the rate of perceived unity in conflicting trials is reduced at
central motion offset locations due to enhanced stimulus
reliability in both modalities.
2. Results
2.1. Experiment 1: audio-visual localization task
Visual and auditory localization performances in the selective
attention task were contrasted with each other to evaluate the
ACOUSTIC CONSTANT VISUAL CONSTANT
UNIMODAL
-15°
-10°
-5°
0°
+5°
+10°
+15°
38° 38°
-60°
-38°
-8° +8°
+38°
+60°
2.35 m
(A) (B)
-15° -10° -5° 0° +5° +10° +15°
Constant signal
Variable signal
SPATIAL DISPARITY
-195 ms -130 ms -65 ms 0ms
-780 ms -520 ms -260 ms 0ms
+65 ms
+260 ms
+130 ms
+520 ms
+195 ms
+780 ms
TEMPORAL DISPARITY
0.5 s signal duration:
2.0 s signal duration:
(C)
Fig. 1 –Experimental setup and stimulus conditions. (A) Array of 47 loudspeakers and 188 LEDs (dots in front of the
loudspeaker symbols) mounted in an azimuthal, semicircular arrangement at a distance of 2.35 m from the subject's head.
Locations in the left and right hemifield are denoted by negative and positive signs, respectively. Arrows indicate the
trajectories of motion stimuli which move towards the target positions from either side. Motion offset locations in the constant
modality are shown by blue loudspeaker symbols and yellow LED symbols. Motion offset locations of the concurrent signal in
the second modality (not indicated) during bimodal stimulation varied with respect to a predefined spatio-temporal disparity
(panel B and C). (B) Uppermost row: Unimodal stimuli (left—acoustic only; right—visual only) traveling an angular range of 38°.
Left side: ‘acoustic constant’, right side: ‘visual constant’. In the audio-visual localization task (experiment 1) two different
attentional conditions were tested. The attended signal traveled a constant angular range (38°), while the unattended signal,
which concurrently started from the same position and traveled with the same angular speed, stopped before, simultaneously
or after motion offset of the signal in the attended modality. Spatial disparities between the attended and unattended signals
were systematically varied between −15° and +15° in steps of 5°. Note, that spatial disparities at motion offset co-varied with
temporal disparities (panel C). The same set of stimuli, with the exception of the unimodal stimulus conditions, was also
employed in the task on detecting spatio-temporal disparities (experiment 2). (C) Relationship of spatial and temporal
disparities at motion offset between signals in the constant and the variable modality. The employed spatial disparities were
identical for 0.5 s and 2.0 s signals, but temporal disparities co-varied both with spatial disparity and signal duration.
101BRAIN RESEARCH 1466 (2012) 99–111
influence of concurrent visual streams on auditory motion
perception and vice versa. In Fig. 2 localization performance at
different motion offset locations in space is plotted as a
function of spatial disparity between the attended and
unattended modality. For reasons of clarity, the respective
temporal disparities are not indicated in the following figures.
Note, however, that temporal offset disparities varied both
as a function of spatial disparity and overall signal duration
(Fig. 1C, Table 1). Mean values are displayed for normalized
data, i.e. relative to the respective unimodal localization
performance. For the statistical comparison between visual
and auditory localization performance, data of the two
stimulus durations (0.5 s, 2.0 s) were submitted to separate
four-way repeated measures analysis of variance (ANOVA),
including the within-subject factors (i) attended modality
(‘attend auditory’,‘attend visual’), (ii) spatial location of
motion offset (8°, 38°, 60°), (iii) motion direction (towards
the periphery, towards the midline) and (iv) spatio-temporal
disparity (for 0.5 s signal duration: −15° (−195 ms), −10°
(−130 ms), −5° (−65 ms), 0° (0 ms), +5° (+65 ms), +10°
(+130 ms), +15° (+ 195 ms); for 2.0 s signal duration (−15°
(−780 ms), −10° (−520 ms), −5° (−260 ms), 0° (0 ms), + 5°
(+260 ms), +10° (+ 520 ms), + 15° (+ 780 ms)).
2.1.1. Signal duration 0.5 s
Localization performance in both the ‘attend auditory’and
the ‘attend visual’condition was affected by concurrent
motion streams in the respective unattended modality
(Fig. 2A). Auditory motion perception mainly varied as a
function of the displaced visual stimulus at central motion
offset locations. The mean values indicate that the magnitude
of localization bias towards the respective final position of
visual trajectories increased with increasing spatio-temporal
disparities. This trend was confirmed by a highly significant
main effect of spatio-temporal disparity (F
(6,150)
= 25.55; p< 0.001,
η
2
=0.505). The localization biases were noticeably reduced, but
still significant, when trajectories with the same disparities
between the visual and the acoustic signal terminated at
paracentral (F
(6,150)
=9.63; p< 0.001, η
2
=0.278) and lateral motion
offset locations (F
(6,150)
=2.51; p =0.044, η
2
=0.091). Opposite
effects were observed for the ‘attend visual’condition. Concur-
rent acoustic streams had the strongest influence on visual
motion perception at lateral motion offset locations (F
(6,150)
=53.12;p<0.001,η
2
= 0.680). That is, the perceived final positions
of visual trajectories were shifted towards the respective final
positions of acoustic trajectories. This biasing effect was
reduced, though still significant at paracentral (F
(6,150)
=29.58;
ATTEND VISUAL
ATTEND AUDITORY
LATERAL
PARACENTRAL
CENTRAL
ATTEND VISUAL
ATTEND AUDITORY
(A)
0.5 seconds
(B) 2.0 seconds
Spatial disparity
Localization bias (deg)
-5
0
5
-15°
-10°
-5°
0°
+5°
+10°
+15°
-5
0
5
-5
0
5
-5
0
5
Fig. 2 –Mean localization bias in the attended modality. Blue symbols: ‘attend auditory’, yellow symbols: ‘attend visual’.
Localization performance is given for central, paracentral and lateral motion offset locations (data are collapsed across both
hemifields). The respective trajectories are indicated on top of the graph. Red circles denote motion offset locations in the
attended modality. Normalized data are plotted as a function of spatial disparity between the acoustic and visual motion offset.
Note, that temporal disparities are not indicated in the graph. The dotted, grey lines refer to the performance in localizing the
respective unimodal control stimulus. Mean values deviating from zero indicate a localization bias towards the non-attended
modality. Lower and upper error bars represent the 25th and the 75th percentile, respectively. A) 0.5 s signal duration in the
attended modality B) 2.0 s signal duration in the attended modality.
102 BRAIN RESEARCH 1466 (2012) 99–111
p<0.001,η
2
=0.542) and central motion offset locations (F
(6,150)
=
7.76; p< 0.001, η
2
=0.237).
The comparison of visual and auditory localization perfor-
mance in a four-way repeated measures ANOVA revealed a
main effect of spatio-temporal disparity (F
(6,150)
=82.99; p < 0.001,
η
2
=0.768). Localization biases in the attended modality were
correlated with spatio-temporal disparities, i.e. the largest
negative shifts were observed for disparities of −15° (−195 ms),
whereas the largest positive shifts occurred for disparities of + 15°
(+195 ms). The overall biasing effect of concurrent acoustic
streams on visual motion perception was larger than the reverse
effect for disparities of −10° (−130 ms), + 10° (+130 ms) and +15°
(+195 ms), which is confirmed by an interaction of attended
modality and spatio-temporal disparity (F
(6,150)
=2.70; p = 0.038,
η
2
=0.097). The inverted pattern of mutual influences between
modalities at central and lateral motion offset locations is
reflected in the significant interaction of attended modality,
spatial location of motion offset and spatio-temporal disparity
(F
(12,300)
=8.31; p<0.001, η
2
=0.249). When subjects indicated the
final position of moving acoustic signals that terminated at
central locations, the magnitude of localization bias varied as a
function of the displaced visual motion offset, whereas the
localization of visual motion offsets in the central field was much
less affected by the displaced acoustic motion offset. However, at
lateral motion offset locations, it was visual motion perception
that was biased in the presence of concurrent acoustic streams
and the magnitude of the respective localization bias depended
on the displaced acoustic motion offset. Concurrent visual
streams, in contrast, were not as effective in biasing the perceived
final position of acoustic trajectories in the periphery. Finally, at
paracentral motion offset locations, comparable biasing effects
towards the unattended modality were observed between the
‘attend auditory’and ‘attend visual’condition.
2.1.2. Signal duration 2.0 s
For longer signal durations, localization performance in the
target modality was less affected or even unaffected by
concurrent motion streams in the unattended modality
(Fig. 2B). Specifically, mean values in the ‘attend visual’
condition were largely comparable to unimodal localization
performance at each spatial location of motion offset. This
indicates that concurrent acoustic motion streams hardly
affected the perceived final position of visual motion. Biasing
effects were observed in the ‘attend auditory’condition at
paracentral (F
(6,150)
=5.73; p = 0.001, η
2
=0.186) and lateral mo-
tion offset locations (F
(6,150)
=6.55; p < 0.001, η
2
=0.208). Surpris-
ingly, the localization biases were towards the opposite
direction (i.e. away from the final position of visual motion).
The comparison of visual and auditory localization perfor-
mance in a four-way repeated measures ANOVA revealed a
main effect of attended modality (F
(1,25)
=5.75; p = 0.024,
η
2
=0.187), which was due to the biasing effects in the ‘attend
auditory’condition. More pronounced biasing effects were
observed for signals moving towards the periphery (F
(1,25)
=
10.19; p= 0.004, η
2
=0.290). Furthermore, the magnitude of
localization bias depended on the spatio-temporal disparity
(F
(6,150)
=12.72; p< 0.001, η
2
=0.337), i.e. localization of the acous-
tic signal was increasingly biased away from the locationof the
visual motion offset with increasing spatio-temporal dispar-
ities. The interaction of attended modality and spatio-temporal
disparity (F
(6,150)
=7.25; p<0.001, η
2
=0.225) is explained by the
fact that the magnitude of the localization bias towards the
opposite direction varied as a function of spatio-temporal
disparity particularly in the ‘attend auditory’condition. Addi-
tionally, the significant interaction of spatial location of motion
offset and spatio-temporal disparity (F
(12,300)
=3.15; p =0.004,
η
2
=0.112) is dueto the fact that localization performance varied
as a function of spatio-temporal disparity at paracentral and
lateral motion offset locations, but was not affected at central
motion offset locations.
2.1.3. Comparing localization performance for 0.5 s and 2.0 s
signals
The employed spatial disparities at motion offset were
identical for short and long signal durations. However, the
presentation of the same spatial offset disparity in signals
that differ in their overall duration implicated a different
temporal offset disparity between the attended and unat-
tended signal, a feature inherent to the variation in the
duration of moving signals. This temporal variation led to
notable differences in localization performance between both
signal durations. Effects of unattended motion streams on
localization performance in the attended modality were
predominantly observed for signals at a duration of 0.5 s, in
which spatial disparities in steps of 5° between −15° and +15°
co-varied with temporal disparities in steps of 65 ms between
−195 ms and + 195 ms. However, the covariation of spatial and
temporal disparities differed in signals at a duration of 2.0 s.
Spatial disparities between −15° and +15° were by accompanied
by larger temporal disparities in steps of 260 ms between
−780 ms and + 780 ms, resulting in a considerably reduced
influenceof the unattended signal on localization performance
in the target modality.
2.2. Experiment 2: integration and separation of audio-
visual motion streams
Reports of perceived unity or separation of moving audio-
visual stimuli (Fig. 3) were classified into four groups,
according to the respective constant signal (acoustic or visual)
and stimulus duration (0.5 s or 2.0 s). Statistical analysis was
based on separate log-linear models for the data of the two
stimulus durations. The included variables were the same as
in experiment 1 (see Section 2.1).
2.2.1. Signal duration 0.5 s
Subjects predominantly reported perceived unity for congru-
ent audio-visual signals or for signals with spatio-temporal
disparities of −/+5° (−/+ 65 ms) between the acoustic and the
visual motion offset (Fig. 3A).
With increasing spatio-temporal disparities the probability
to integrate both signals constantly decreased as reflected in a
one-way association of spatio-temporal disparity (χ
2(0.95,6)
=
101.7; p<0.001). A two-way association of spatio-temporal
disparity and constant modality (χ
2(0.95,6)
=60.90; p< 0.001) is
explained by an increased probability to separate signals
when the visual trajectory exceeds the acoustic trajectory in
terms of motion path and duration. That is, for the constant
acoustic trajectory, the percentage of reported unity, in
particular for spatial disparities of +10° (+130 ms) and +15°
103BRAIN RESEARCH 1466 (2012) 99–111
(+195 ms) (visual trajectory longer than the acoustic trajectory),
was considerably smaller compared to spatio-temporal dispar-
ities of −10° (−130 ms) and −15° (−195 ms) (visual trajectory
shorter than the acoustic trajectory). Likewise, when the visual
trajectory was constant, the probability to integrate both signals
was lower for spatio-temporal disparities of −10° (−130 ms) and
−15° (−195 ms) (visual trajectory longer than the acoustic
trajectory) compared to spatio-temporal disparities of +10°
(+130 ms) and +15° (+ 195 ms) (visual trajectory shorter than
the acoustic trajectory). No one-way associations or higher-
order associations were found for the factors spatial location of
motion offset or motion direction. This indicates that the ratios
of integration and separation are similar for stimulus combina-
tions with identical spatio-temporal disparities, but traveling
along different trajectories.
2.2.2. Signal duration 2.0 s
Reports of perceived unity were mainly observed for congru-
ent trials. A lower percentage of audio-visual signals was
integrated at spatio-temporal disparities of −/+5° (−/+260 ms)
and signals were almost exclusively perceived as separated
for spatio-temporal disparities of −/+10° (−/+520 ms) and −/+
15° (−/+780 ms) (Fig. 3B). A one-way association of spatio-
temporal disparity (χ
2(0.95,6)
=858.1; p< 0.001) confirms the
increased probability of stimulus separation with increasing
disparities between both streams. Furthermore, the percent-
age of reported integration was lower when the moving visual
signal's termination point exceeded its acoustic counterpart.
For example, for trials in which the acoustic trajectory was
constant, the percentage of reported signal unity was about
70% when the visual trajectory was shorter than the acoustic
trajectory (−5°, −260 ms), but considerably lower (about 20%)
when the visual trajectory exceeded the acoustic trajectory by
+5° (+260 ms). Response behavior was similar for trials in
which the visual trajectory was constant. This finding was
confirmed by a two-way association of spatio-temporal
disparity and constant modality (χ
2(0.95,6)
=100.19; p < 0.001).
Again, there were no one-way associations or higher-order
associations including the factors spatial location of motion
offset or motion direction, indicating that stimulus combina-
tions with identical spatio-temporal disparities were equally
likely to be integrated regardless of the traveled trajectory.
2.2.3. Comparing discordance detection for 0.5 s and 2.0 s
signals
The rate of perceived unity showed a pronounced asymmetry
between short (Fig. 3A) and long signal durations (Fig. 3B). As
described for experiment 1, one has to take into account that
temporal disparities at motion offset covary both with spatial
disparity and overall signal duration which notably influenced
performance in detecting incongruent trials. The same spatial
disparities ranging between −15° and +15° were accompanied
by much larger temporal disparities in 2.0 s signals which is
reflected in overall lower probabilities to integrate both signals.
0
20
40
60
80
100
-15° -10° -5° 0° +5° +10° +15°
0
20
40
60
80
100
-15° -10° -5° 0° +5° +10° +15°
(A) 0.5 seconds
(B) 2.0 seconds
ACOUSTIC CONSTANT VISUAL CONSTANT
Spatial disparity
Percentage reports of unity (%)
47° to 8°
30° to 8°
77° to 38°
0° to 38°
90° to 60°
22° to 60°
Fig. 3 –Percentage of reports of perceived unity in the constant modality. Percentages are given for each trajectory and each
spatial disparity between the acoustic and visual motion offset (data are collapsed across both hemifields). Note, that temporal
disparities are not indicated in the graph. Different colors indicate different trajectories as illustrated on top of the graph and
specified in the legend. Black circles denote motion offset locations in the constant modality. A) 0.5 s signal duration in the
constant modality B) 2.0 s signal duration in the constant modality.
104 BRAIN RESEARCH 1466 (2012) 99–111
Still, there were two general principles that applied to both
signal durations: when the visual trajectory exceeded the
acoustic trajectory, the probability of perceiving separate
signals was increased compared to signals in which the
acoustic trajectoryexceeded the visual trajectory. Furthermore,
the proportion of perceived unity and separation for identical
spatio-temporal disparities was not affected by the traveled
trajectory of the audio-visual stimulus.
3. Discussion
The goal of the current experiments was to disentangle the
contributions of crossmodal interactions and multisensory
integration to the perception of audio-visual motion in space.
So far, spatial effects have not been investigated in detail. In
the present study, the location of the trajectories’termination
point (central vs. lateral) has proven to be a crucial factor for
the crossmodal interplay between the visual and the auditory
system. However, the location of the trajectories’termination
point did not influence the ratio of perceived unity and
separation of audio-visual motion streams.
We found that, in particular for short signal durations, the
magnitude of crossmodal localization bias towards the
location of motion offset in the non-attended modality varied
as function of spatio-temporal disparity between the visual
and acoustic motion offset. However, when the task required
a decision on whether the audio-visual stimuli were spatio-
temporally aligned or not, the percentage of reported unity
decreased with increasing spatio-temporal disparities. This
indicates that crossmodal interactions do not depend on the
integration of the audio-visual motion stream. Moreover, the
results suggest that subjects benefited jointly from spatial and
temporal cues in the perception of audio-visual motion and
that temporal disparities at motion offset likely explain the
observed asymmetries in performance between short and
long signal durations in both experiments.
3.1. Crossmodal interactions between the visual and the
auditory system depend on the trajectory in space
The findings of the audio-visual localization task indicate
mutual crossmodal interactions between the visual and the
auditory system in the perception of motion. Thereby, the
influenceof the unattended signal on localization performance
in the attended modality crucially depended on the specific
trajectory as well as on the spatio-temporal disparity between
the visual and the acoustic motion offset. Due to the fact that
the stimuli were initially congruent, the modality determining
the percept in conflicting trials could only emerge at the final
part of the moving stimulus. Crossmodal interactions between
the visual and the auditory modality were mainly observed for
short stimulus durations (0.5 s), but were found to be much
weaker for longer signal durations (2.0 s). Even though the
predefined spatial conflicts were identical for both signal
durations (−/+5°, −/+10°, −/+15°), moving audio-visual stimuli
differed with respect to temporal offset disparities (65 ms to
195 ms in short signals and 260 ms to 780 ms in long signals),
which likely explains the discrepancy in localization perfor-
mance between short and long signal durations. For 0.5 s
signals, visual motion perception was most susceptible to be
influenced by concurrent acoustic streams at lateral motion
offset locations, whereas auditory motion perception was
biased towards visual information mainly at central motion
offset locations. The results do not support the notion of an
existing asymmetry between modalities in the perception of
motion (Soto-Faraco et al., 2004), i.e. a visual dominance, but
rather highlight the crucial role of the auditory modality when
the reliability of the visual stimulus is reduced (Alais and Burr,
2004; Hidaka et al., 2011). Our results might therefore reflect
differential abilities between the visual and the auditory system
when attempting to accurately track dynamically changing
location cues along different trajectories in space. The process-
ing of visual motion is most accurate in foveal regions, and this
accuracy in detecting positional changes decreases remarkably
in the peripheral visual field (Finlay, 1982; McKee and
Nakayama, 1984; To et al., 2011). Studies on spatial acuity in
the auditory modality have shown that the minimum auditory
movement angle (MAMA) is smallest in the central field
(Grantham, 1986). However, Perrott et al. (1987, 1990) pointed
out that the resolution of the auditory system across space is
roughly homogeneous in relation to the visual resolution. That
is, optimal visual resolution is only possible in the fovea,
whereas the acuity in perceiving sound sources is much more
constant across space. This crucial discrepancy between both
modalities is likely the key underlying feature that accounts for
the results of the present study. On the one hand, our results
show that concurrent visual motion streams ‘capture’dynamic
auditory information at central motion offset locations, thus
confirming the results of previous studies on audio-visual
motion perception (e.g. Allen and Kolers, 1981; Mateeff et al.,
1985; Soto-Faraco et al., 2002; Strybel and Vatakis, 2004). Visual
motion perception, in contrast, was only slightly biased towards
the final position of the acoustic trajectory in the central field.
This suggests that foveal tracking of motion features in the
visual system is more precise than dynamic tracking of
interaural differences at midline positions in the auditory
system. Indeed, during dynamic visual capture, Alink et al.
(2008) demonstrated reduced activity in the auditory motion
complex (AMC) and simultaneously enhanced activity in the
human middle temporal area. The authors concluded that
altered activity in early auditory motion areas reflects a
neuronal correlate of visual dominance.
On the other hand, our results also demonstrate that
concurrent acoustic streams ‘capture’visual motion informa-
tion at lateral motion offset locations, whereas concurrent
visual streams were not as effective in biasing auditory motion
perception in the periphery. We hypothesize that superior
peripheral tracking of dynamically changing location cues in
the auditory system accounts for the observed biasing effects
on visual motion perception. Thisis in line with the notion that
spatial acuity in the periphery is more declined in the visual
system when compared to the auditory system, as previously
proposed by Perrott et al. (1987, 1990). Concerning the study by
Alink et al. (2008), the question arises whether an inverted
activation shift between auditory and visual motion areas
would be observed, once the auditory modality dominates
perception. That is, enhanced activity in the respective cortical
auditory areas could directly affect motion processing in visual
areas. The hypothesisof a gradual shift in the activation pattern
105BRAIN RESEARCH 1466 (2012) 99–111
is also supported by the fact that the influence of the respective
unattended modality on visual or acoustic localization perfor-
mance was equivalent at paracentral motion offset locations in
the present study. Thus, the visual and the auditory system
seem to be equal in their capabilities to track dynamic
information at intermediate spatial locations.
For 2.0 s signals, crossmodal interactions were only found
for the ‘attend-auditory’condition. The biasing effects were
much smaller and, interestingly, negatively correlated with
spatio-temporal disparity. Wallace et al. (2004) also found a
negative localization bias for static audio-visual events that
were either spatially or temporally disparate. In the present
study, concurrent acoustic and visual signals moved in the
same direction and differed only with regard to spatial and
temporal motion offsets. Specifically, in the case of prolonged
stimulus durations, we assume that multisensory predictions
about the trajectory of an ongoing motion are established in
the respective brain areas which are then violated when one
signal ceases abruptly. Van Wanrooij et al. (2010) demonstrat-
ed that multisensory expectations based on the spatial
congruence or incongruence of audio-visual stimuli of previ-
ous trials have an influence on reaction times and accuracy in
orienting towards the current audio-visual stimulus. Further-
more, the violation of multisensory expectations has been
shown to affect the dynamics of oscillatory activity in the
brain and reflects the detection of an intermodal conflict
(Arnal et al., 2011), which in turn might interfere with our
ability to properly localize motion stimuli. In the current
study, however, the detection of a discrepancy between both
streams seems to have an impact on auditory, but not on
visual localization performance. Though, it is debatable
whether the consequence of a putative prediction error can
be interpreted in terms of a crossmodal localization bias or
rather reflects a distinct phenomenon. Additional research is
needed to clarify this aspect.
3.2. Spatio-temporal analysis is crucial for the integration
and separation of audio-visual motion streams
In the second experiment we found that the performance in
detecting conflicting audio-visual trials improved with increasing
spatio-temporal disparities between the acoustic and the visual
motion offset. The results are therefore in line with previous
studies using stationary audio-visual events (Bertelson and
Radeau, 1981; Lewald and Guski, 2003; Slutsky and Recanzone,
2001; Wallace et al., 2004). As conflicting trials only differed at
motion offset, one can assume that the decision regarding the
integration or separation of both streams depended on the final
part of the audio-visual stimulus. Furthermore, we demonstrat-
ed the crucial role of the co-analysis of spatial and temporal
features in integrating multisensory information. Though the
probability of stimulus separation generally increased with
increasing spatial disparities both for 0.5 s and 2.0 s signals,
subjects’performance showed pronounced asymmetries with
respect to the absolute rate of stimulus separation. Inherent to
the prolongation of the overall signal duration were larger
temporal offset disparities between the acoustic and visual
stream in 2.0 s signals which likely account for the observed
differences. It can be assumed that the temporal disparities in
2.0 s signals did not fall within the temporal window of
integration (Colonius and Diederich, 2010). Consequently,
these signals were more likely to be perceived as separated
compared to those at a duration of 0.5 s.
Unexpectedly, the proportion of perceived unity and
separation for stimulus combinations with identical spatio-
temporal disparities was comparable across trajectories. This
finding applied to both signal durations. It can be reasoned
that subjects may have additionally relied on a comparison of
temporal offsets between the visual and acoustic signal. Thus,
the combined use of spatial and temporal cues seems to
provide redundant information at each spatial location of
motion offset.
Another important finding was the role of the temporal
order of visual and acoustic signal offsets for perceived signal
unity. Visual motion offsets prior to acoustic motion offsets
resulted in a higher percentage of reports of perceived unity
than vice versa. In other words, segregation of both streams
occurred more frequently when the visual trajectory exceeded
the acoustic trajectory. This finding is likely the direct result
of the fact that light travels more quickly than sound (King,
2005; Recanzone, 2009), i.e. the visual information is assumed
to reach the sensory receptors earlier than the associated
acoustic information. Therefore, events violating this fact, e.g.
when sound precedes light or when the visual trajectory
exceeds the acoustic trajectory, are not considered to origi-
nate from the same object (Morein-Zamir et al., 2003) and thus
the streams tend to be separated.
3.3. Crossmodal interactions and multisensory integration
in motion perception are independent processes
Crossmodal interactions between the visual and the auditory
system were predominantly observed for short signal dura-
tions. Depending on the specific traveled trajectory in space,
both the perceived auditory and visual motion information
were subject to a ‘capture’by the unattended modality.
Thereby, the magnitude of localization bias varied as a
function of spatio-temporal disparity between the visual and
the acoustic motion offset. Biasing effects were still observed
for the largest spatio-temporal disparities of -/+15° (−/
+195 ms). In contrast, a reverse trend was observed for the
subjects’reports of perceived integration or separation of both
sensory streams. Increasing spatio-temporal disparities led to
more reliable separations. Thus, it can be concluded that
integration of both streams is not a necessary prerequisite for
a perceptual bias of information towards the irrelevant
modality as previously suggested by Bertelson and Radeau
(1981). The results of the present study stand in contrast to
those of Wallace et al. (2004), where subjects were presented
with congruent and conflicting stationary audio-visual
events. The subjects were instructed to localize the position
of the acoustic signal and to indicate at the same time
whether the visual and the acoustic signal were perceived as
unified or separated. Hence, Wallace et al. (2004) did not
separately investigate crossmodal interactions and multisen-
sory integration. They showed that perceived unity was
correlated with localization of the stationary auditory signal
at or very near the location of the stationary visual signal. This
is not surprising, given the fact that subjects may have relied
on the non-ignored visual signal instead of using auditory
106 BRAIN RESEARCH 1466 (2012) 99–111
cues for localization. However, the assumption of unity
implies integration into a coherent percept whereby different
streams of information are perceived as emanating from the
same spatial location. Localization in a selective attention
task, in contrast, rather reflects crossmodal interactions
between senses without necessarily assuming that various
streams belong to the same object. In the present study,
localization biases towards the non-attended signal were
observed at intermediate positions between the distance of
the visual and the acoustic motion offset location. Thus, the
results support the existence of genuine localization biases
and do not reflect pure decisional strategies.
Furthermore, our data provide clear evidence for a co-
analysis of spatial and temporal cues in the perception of
bimodal motion streams. Though one cannot dissociate the
respective contribution of spatial and temporal cues to
behavioral performance for a given spatio-temporal disparity,
the comparison between short and long signal durations
revealed that multisensory interactions only occur when both
cues are provided within a specific range of tolerance. Both
interaction and integration effects were much weaker for 2.0 s
signals. Although these signals were presented with identical
predefined spatial disparities as those at a duration of 0.5 s,
temporal disparities were much larger and obviously exceeded
the temporal window for binding of various streams of informa-
tion. Temporal disparities are therefore likely to explain the
observed asymmetries both in the magnitude of localization bias
and the rate of perceived unity between 0.5 s and 2.0 s signals.
Previously, audio-visual interactions in motion perception
have mainly been studied in tasks in which subjects were
asked to discriminate the direction of motion in one modality
while ignoring a crossmodal distractor in another modality
(Allen and Kolers, 1981; Mateeff et al., 1985; Soto-Faraco et al.,
2002; Strybel and Vatakis, 2004). In the present study, cross-
modal interactions and multisensory integration were studied
with concurrently presented acoustic and visual signals that
moved in identical directions but were conflicting at motion
offset. The results support the notion that the visual modality
dominates motion perception in the central field which is
most likely due to superior tracking of positional information
at the midline (Soto-Faraco et al., 2004). However, tracking of
dynamically changing location cues in the periphery seems to
be more accurate in the auditory system as concurrent
acoustic motion streams biased visual motion perception at
lateral motion offset locations. The results of the current
study confirm that degraded reliability of the visual signal in
motion perception can be compensated for by the auditory
modality as previously proposed by Hidaka et al. (2011).
Taken together, the present findings indicate that different
trajectories in space alter the perceived quality of moving
visual and moving acoustic signals and thus their suscepti-
bility to capture by the other modality. The crossmodal
localization bias towards the location of the final position in
the non-attended modality still occurs even when the moving
audio-visual signal would be perceived as separated. This
finding implicates a more restricted range of tolerance for
spatial and/or temporal disparities in multisensory integra-
tion compared to crossmodal interactions, an aspect that
needs to be further considered in future studies investigating
multisensory interactions.
4. Conclusion
Our data provide evidence that the interplay of the visual and
the auditory modality in motion perception crucially depends
on the trajectory. The findings indicate that positional
information in the central field is more accurately tracked by
the visual system, since concurrent visual streams biased
auditory motion perception mainly at central motion offset
locations. The auditory system, in contrast, seems to be
superior to the visual system in tracking positional informa-
tion in the peripheral space as visual localization performance
was biased towards the final position of the acoustic
trajectory mainly at lateral motion offset locations. The
magnitude of localization bias thereby varied as a function
of the spatio-temporal disparity between the visual and the
acoustic stream. Importantly, the interplay between modali-
ties was only observed when temporal conflicts at motion
offset did not exceed a critical range. The results furthermore
suggest that crossmodal interactions occur independently
from the integration of the audio-visual motion stream.
5. Experimental procedure
5.1. Subjects
Twenty-six subjects (14 females, 12 males; 4 left-handed;
mean age: 26.4 years; age range 20–33 years) with normal or
corrected-to-normal vision and normal hearing abilities
participated both in experiment 1 and experiment 2. None of
the subjects reported any neurological disorder. All subjects
gave informed written consent and were compensated for
their participation. This study conformed to The Code of
Ethics of the World Medical Association and was approved by
the local Ethics Committee of the University of Leipzig.
5.2. Setup and stimuli
The experiments were conducted in an anechoic, sound
attenuated free-field laboratory (40 m
2
, Industrial Acoustics
Company [IAC]). Forty-seven broad-band loudspeakers (Visa-
ton, FRS8 4) were mounted in an azimuthal, semicircular array
at ear level (Fig. 1A). A comfortable, fixed chair was positioned
in the middle of the semicircle at a constant distance of 2.35 m
from the loudspeakers such that subjects were aligned
straight ahead to the central speaker at 0°. The loudspeaker
array covered an azimuthal plane from −98° to the left to +98°
to the right. The angular distance between two loudspeaker
membranes was 4.3°. Each loudspeaker was calibrated indi-
vidually. For this, the transmission spectrum was measured
using the Brüel & Kjær measuring amplifier (B&K 2610), a
microphone (B&K 2669, pre-amplifier B&K 4190) and a real-
time signal processor (RP 2.1, System3, Tucker Davis Technol-
ogies, TDT). For each loudspeaker a calibration file was
generated in Matlab 6.1 (The MathWorks Inc, Natick, USA)
and later used for presentation of acoustic stimuli with flat
spectra across the frequency range of the stimulus.
The speaker array was combined with an array of 188
white light emitting diodes (LED) mounted in azimuthal steps
107BRAIN RESEARCH 1466 (2012) 99–111
of 1° at eye-level. The LEDs were controlled by a set of 51
printed circuit boards (PCB), which were interfaced with a
desktop PC. Each PCB was assembled with four infra-red (IR)
sensitive phototransistors for the registration of pointing
directions. The phototransistors were arranged with the
same angular distances as the LEDs, but extended beyond
the loudspeaker- and LED array by 8° to both the left and the
right. In combination with the IR-sensitive phototransistors,
the LED array was also used to provide visual feedback of the
angular position pointed to by the subjects. A customized IR-
torch served as pointing device (Solarforce L2 with 3W NVG
LED). The subtended angle of the IR-light beam covered a
maximum of 8° at the level of the LEDs. The mean position
across all activated IR-sensitive phototransistors was com-
puted online and the corresponding LED flashed up as a visual
feedback for the subject.
The loudspeakers and LEDs were hidden behind acousti-
cally transparent gauze, which did not affect visibility of the
LEDs. Thus, subjects were unable to make use of landmarks
during the localization and detection tasks. An infra‐red
camera was installed in the test chamber to monitor subjects’
performance during the experimental sessions. Custom
written MATLAB scripts (R2007b, The MathWorks Inc., Natick,
USA) were used to control stimulus presentation and data
acquisition. Visual and acoustic signals were digitally gener-
ated using RPvdsEx (Real Time Processor Visual Design Studio,
Tucker Davis Technologies, TDT) and delivered to two multi-
channel signal processors (RX8, System3, TDT).
Acoustic stimuli were low-frequency Gaussian noise bursts
(250–1000 Hz) that were presented at 40 dB SL (sensation level).
Sound localization in this low-frequency range is primarily
based on the processing of interaural time differences (ITDs).
Sound motion was simulated by successive activation of
adjacent loudspeakers. To obtain a continuous motion, the
ratio of sound intensity between two adjacent loudspeakers
was adjusted by linear cross-fading of the output voltage. The
level roving (variability in sound intensity around presentation
level) was set to −/+3 dB to avoid adaptation to loudness-related
localization cues. Visual stimuli were light spots at a luminance
of 2.5 lux. Moving visual signals were simulated by successive
activation of adjacent LEDs. The small distance of 1.0° between
two adjacent LEDs was sufficient to generate an apparent
motion percept.
Stimuli were unimodal (acoustic only/visual only) and
concurrent audio-visual motion streams. The constant signal
in moving audio-visual stimuli (attended modality in exper-
iment 1 or constant modality in experiment 2) traveled an
angular range of 38°. The speed of motion, however, varied
with the signal duration of the presented sequence (either
0.5 s or 2.0 s, including 10 ms rise and decay times). The final
positions of the constant signals were located at −8°, −38°,
−60°, +8°, +38°, and +60°, respectively (Fig. 1A). Motion offset
locations at −/+8° were defined as central, at −/+38° as
paracentral and at −/+60° as lateral. The concurrent signal in
the second modality finished either at a congruent location or
spatially and temporally displaced with respect to the final
position of the constant signal (Fig. 1B). Note, that the
predefined spatial disparities (−/+5°, −/+ 10°, −/+ 15°) for 0.5 s
and 2.0 s signals were identical, but temporal disparities co-
varied both with spatial disparity and signal duration (Fig. 1C,
Table 1). Control stimuli (acoustic only or visual only) were
identical to the respective constant signal of audio-visual
signals in terms of trajectory and signal duration. To examine
the possible effect of motion direction on localization perfor-
mance, unimodal and audio-visual stimuli moved towards
the final position from either side (i.e. towards the midline
and towards the periphery, Fig. 1A).
5.3. Study design and procedure
Subjects were tested in complete darkness. Prior to experimen-
tal testing the detection threshold for moving sounds was
obtained for each subject to adjust the presentation level for
acoustic signals during the tests to 40 dB SL. Employing a heard/
not-heard paradigm, the subjects were asked to indicate by a
button press on a response box whether they detected an
acoustic signal (Gaussian noise bursts, 250–1000 Hz) moving
from −38° to 0° in the left hemifield. The initial sound level of
the moving stimulus was set to 63 dB SPL. When the subjects
detected the stimulus, sound level was decreased in steps of
2.5 dB. Otherwise, sound level was increased in equal step sizes.
When the subjects were confident on their individual hearing
threshold, i.e. the minimum sound pressure level which they
required to detect the moving stimulus, they confirmed their
decision by a button press. This reference value was used to set
the acoustic stimulus at 40 dB SL.
Table 1 –Spatial and temporal disparities at motion offset. The indications refer to disparities between the constant signal
(either a moving acoustic or a moving visual signal) and the corresponding variable signal of audio-visual stimuli as
presented in the audio-visual localization task (experiment 1) and the task on detecting spatio-temporal disparities
(experiment 2). Signal duration in the constant modality was either (A) 0.5 s or (B) 2.0 s. Note, that the employed spatial
disparities were identical for both signal durations, but temporal disparities co‐varied both with spatial disparity and signal
duration. In experiment 1 the indications for the constant modality and the variable modality correspond to the attended
modality and the non-attended modality, respectively.
(A) 0.5 s
Spatial disparity −15° −10° −5° 0° +5° + 10° + 15°
Signal duration constant modality 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s
Signal duration variable modality 0.305 s 0.37 s 0.435 s 0.5 s 0.565 s 0.63 s 0.695 s
Temporal disparity −0.195 s −0.13 s −0.065 s 0 s +0.065 s + 0.13 s + 0.195 s
(B) 2.0 s
Spatial disparity −15° −10° −5° 0° +5° + 10° + 15°
Signal duration constant modality 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s
Signal duration variable modality 1.22 s 1.48 s 1.74 s 2.0 s 2.26 s 2.52 s 2.78 s
Temporal disparity −0.78 s −0.52 s −0.26 s 0 s +0.26 s + 0.52 s +0.78 s
108 BRAIN RESEARCH 1466 (2012) 99–111
Each subject participated in two experiments: (i) an audio-
visual localization task and (ii) a detection task in which
moving audio-visual stimuli had to be judged as spatio-
temporally congruent or incongruent.
Both experiments were divided into blocks. Short breaks
were allowed after completion of each block. Each block
started with the presentation of two stationary stimuli that
could be ignored by subjects. Additional 5 stationary stimuli
per block (acoustic, visual or audio-visual) were randomly
interspersed between moving stimuli to avoid adaptation to
motion. Subjects were instructed to look straight ahead
during the trials and not to pursue the moving signals
neither with their eyes nor their head. Unlike in a previous
free-field study in which subjects were asked to fixate an LED
during stimulus presentation (Hofbauer et al., 2004), no
fixation point was provided in the present study to exclude
the possibility that subjects could make use of it as a
reference for the midline position in the darkened test
chamber. The subjects’position was permanently monitored
by the experimenter via video stream from the test chamber.
Trials in which subjects moved their head were excluded
from data analysis.
5.3.1. Audio-visual localization task
Each subject completed a practice run consisting of 30 stimuli
to become familiar with the task and the infra-red pointing
device. During two test sessions, a total of 384 moving
stimulus combinations (336 audio-visual, 24 acoustic, 24
visual) were presented in 16 experimental blocks. An exper-
imental session consisted of 8 blocks. Each stimulus combi-
nation was presented twice. Stimuli within an experimental
block were presented in randomized order. Additionally, the
order of the 16 experimental blocks was counterbalanced
across subjects. In a blocked design, subjects were instructed
to selectively focus their attention on either the acoustic
(‘attend auditory’) or the visual (‘attend visual’) component of
the moving audio-visual stimuli. Instruction on the respective
to-be-attended-modality was given prior to each test block.
Moving unimodal stimuli within an experimental block were
only presented in the modality that corresponded to the
attended modality. Subjects were asked to indicate the
perceived final position of the moving targets in the attended
modality by pointing with an IR-torch. Visual feedback on the
indicated angular position was provided by flashing up the
corresponding LED. To confirm their response, subjects had to
release the button on the IR-torch whereby the designated
LED flashed three times and signalized successful registration
of the angular position. Responses were automatically stored
for subsequent data analyses. The next trial started after an
intertrial-interval (ITI) of 3.0 s, such that subjects were able to
re-orient towards the midline position. A response to each
moving target was required before the next trial could begin.
No feedback on the correct angular position was given at any
time of the test session. Fig. 4 illustrates the experimental
procedure.
5.3.2. Integration and separation of audio-visual motion streams
Each subject completed a practice run consisting of 20 stimuli to
become familiar with the task. Only audio-visual stimulus
combinations (the same as in the audio-visual localization task)
were presented in randomized order in eight experimental
blocks. The order of the eight blocks was counterbalanced across
subjects. Subjects were asked to simultaneously attend to the
acoustic and the visual part of the moving stimulus and to judge
whether motion offsets were spatio-temporally aligned or not
(same-different judgments). After each trial subjects indicated
their response by pressing the corresponding button on a
response box. The ITI between button press and the onset of
the next trial was 2.0 s. A response to each moving audio-visual
stimulus was required before the next trial could begin. No
feedback on the correct response was given at any time of the
test session.
5.4. Data analysis
Subjects’performance did not differ between both hemifields,
so data for each experiment were collapsed for the respective
trajectories in the left and right hemifield.
5.4.1. Audio-visual localization task
Localization bias in the attended modality was quantified as the
difference between the indicated final position and the actual
final position of motion for each stimulus combination. Addi-
tionally, data were normalized with respect to individual unim-
odal localization performance. Normalization compensated for
the overestimation of the final position which is inherent
to motion perception (Hubbard, 2005). Displaying normal-
ized data allowed for the direct comparison of the
magnitude of localization bias between different trajecto-
ries. Stimulus repetitions were averaged for each subject.
ITI
2.0s 2.0s 0.5s 2.0s 0.5s 0.5s 2.0s 0.5s 2.0s 0.5s
0.5s
0.5s
0.5s
RR R RRRRRRR
...
Stationary visual stimulus
Moving visual stimulus
Moving sound
Moving audio-visual stimulus
Stationary sound
Stationary audio-visual stimulus
Fig. 4 –Experimental procedure. The symbols illustrate a sequence of different types of moving and stationary stimuli that
were presented in randomized order within an experimental block. The stimulus durationsof signals in the attended modality
(audio-visual localization task) and in the constant modality (detection task on spatio-temporal disparities) are indicated below
the symbols. R = response (experiment 1: localization of the final position of motion in the attended modality using an infra-red
torch, experiment 2: button press in the task on detecting spatio-temporal disparities), ITI = intertrial-interval.
109BRAIN RESEARCH 1466 (2012) 99–111
The results for the respective stimulus combinations were
grouped according to both the attended modality and the
stimulus duration. Statistical analyses were based on
normalized data. Data were submitted to a four-way
repeated measures ANOVA. The included within-subject
factors are specified in Section 2.1.Dataofthetwo
stimulus durations(0.5 s, 2.0 s) were analyzed in separate
ANOVAs.
5.4.2. Integration and separation of audio-visual motion streams
Response behavior was evaluated with regards to the rate of
perceived unity for each stimulus combination. The results
for the respective stimulus combinations were grouped
according to both the constant modality and the stimulus
duration. Data (same/different responses) were nominal vari-
ables and were submitted to a log-linear analysis for
statistical analysis. Log-linear models predict the expected
frequency counts in a contingency table for a two- or more
factorial design. Differences between observed frequencies
and expected frequencies are expressed in one-way associa-
tions (main effects) and two-way and higher-order associa-
tions (interactions). The log-linear data analysis was based on
the number of subjects that reported perceived unity for a
given stimulus combination. Included variables were the
same as in the audio-visual localization task (see Section 2.1).
Separate analyses were conducted for the data of the two
stimulus durations (0.5 s and 2.0 s).
Acknowledgments
This work was supported by the Deutsche Forschungsge-
meinschaft (DFG), graduate program ‘Function of attention in
cognition’. We wish to thank two reviewers for valuable
comments and suggestions. We are also thankful to Ingo
Kannetzky, Jörg Eckebrecht and Matthias Freier for technical
assistance and to Patrice Voss for proofreading of the manuscript.
REFERENCES
Alais, D., Burr, D., 2004. The ventriloquist effect results from
near-optimal bimodal integration. Curr. Biol. 14 (3), 257–262.
Albright, T.D., Stoner, G.R., 1995. Visual motion perception. Proc.
Natl. Acad. Sci. U. S. A. 92 (7), 2433–2440.
Alink, A., Singer, W., Muckli, L., 2008. Capture of auditory motion
by vision is represented by an activation shift from auditory
to visual motion cortex. J. Neurosci. 28 (11), 2690–2697.
Alink, A., Euler, F., Galeano, E., Krugliak, A., Singer, W., Kohler, A.,
2012. Auditory motion capturing ambiguous visual motion.
Front. Psychol. 2, 391.
Allen, P.G., Kolers, P.A., 1981. Sensory specificity of apparent
motion. J. Exp. Psychol. Hum. Percept. Perform. 7 (6), 1318–1328.
Arnal, L.H., Wyart, V., Giraud, A., 2011. Transitions in neural
oscillations reflect prediction errors generated in audiovisual
speech. Nat. Neurosci. 14 (6), 797–801.
Battaglia, P.W., Jacobs, R.A., Aslin, R.N., 2003. Bayesian integration
of visual and auditory signals for spatial localization. J. Opt.
Soc. Am. A Opt. Image Sci. Vis. 20 (7), 1391–1397.
Bertelson, P., Radeau, M., 1981. Cross-modal bias and perceptual
fusion with auditory-visual spatial discordance. Percept.
Psychophys. 29 (6), 578–584.
Bertelson, P., Vroomen, J., de Gelder, B., Driver, J., 2000. The
ventriloquist effect does not depend on the direction of
deliberate visual attention. Percept. Psychophys. 62 (2), 321–332.
Bolognini, N., Frassinetti, F., Serino, A., Làdavas, E., 2005. "Acous-
tical vision" of below threshold stimuli: interaction among
spatially converging audiovisual inputs. Exp. Brain Res. 160 (3),
273–282.
Brooks, A., van der Zwan, R., Billard, A., Petreska, B., Clarke, S.,
Blanke, O., 2007. Auditory motion affects visual biological
motion processing. Neuropsychologia 45 (3), 523–530.
Burr, D., Alais, D., 2006. Combining visual and auditory
information. Prog. Brain Res. 155, 243–258.
Calvert, G.A., Hansen, P.C., Iversen, S.D., Brammer, M.J., 2001.
Detection of audio-visual integration sites in humans by
application of electrophysiological criteria to the BOLD effect.
Neuroimage 14 (2), 427–438.
Colonius, H., Diederich, A., 2010. The optimal time window of
visual–auditory integration: a reaction time analysis. Front.
Integr. Neurosci. 4, 11.
Ernst, M.O., Banks, M.S., 2002. Humans integrate visual and
haptic information in a statistically optimal fashion.
Nature 415 (6870), 429–433.
Ernst, M.O., Bülthoff, H.H., 2004. Merging the senses into a robust
percept. Trends Cogn. Sci. 8 (4), 162–169.
Finlay, D., 1982. Motion perception in the peripheral visual field.
Perception 11 (4), 457–462.
Freeman, E., Driver, J., 2008. Direction of visual apparent motion
driven solely by timing of a static sound. Curr. Biol. 18 (16),
1262–1266.
Getzmann, S., 2007. The effect of brief auditory stimuli on visual
apparent motion. Perception 36 (7), 1089–1103.
Grantham, D.W., 1986. Detection and discrimination of simulated
motion of auditory targets in the horizontal plane. J. Acoust.
Soc. Am. 79 (6), 1939–1949.
Hairston, W.D., Wallace, M.T., Vaughan, J.W., Stein, B.E., Norris,
J.L., Schirillo, J.A., 2003. Visual localization ability influences
cross-modal bias. J. Cogn. Neurosci. 15 (1), 20–29.
Hidaka, S., Teramoto, W., Sugita, Y., Manaka, Y., Sakamoto, S.,
Suzuki, Y., 2011. Auditory motion information drives visual
motion perception. PLoS One 6 (3), e17499.
Hofbauer, M., Wuerger, S.M., Meyer, G.F., Roehrbein, F., Schill, K.,
Zetzsche, C., 2004. Catching audiovisual mice: predicting the
arrival time of auditory–visual motion signals. Cogn. Affect.
Behav. Neurosci. 4 (2), 241–250.
Hubbard, T.L., 2005. Representational momentum and related
displacements in spatial memory: a review of the findings.
Psychon. Bull. Rev. 12 (5), 822–851.
Jain, A., Sally, S.L., Papathomas, T.V., 2008. Audiovisual
short-term influences and aftereffects in motion: examination
across three sets of directional pairings. J. Vis. 8 (15), 7.1–7.13.
Kafaligonul, H., Stoner, G.R., 2010. Auditory modulation of visual
apparent motion with short spatial and temporal intervals.
J. Vis. 10 (12), 31.
King, A.J., 2005. Multisensory integration: strategies for
synchronization. Curr. Biol. 15 (9), R339–R341.
Kitajima, N., Yamashita, Y., 1999. Dynamic capture of sound
motion by light stimuli moving in three-dimensional space.
Percept. Mot. Skills 89 (3 Pt 2), 1139–1158.
Lewald, J., Guski, R., 2003. Cross-modal perceptual integration
of spatially and temporally disparate auditory and visual
stimuli. Brain Res. Cogn. Brain Res. 16 (3), 468–478.
Lewis, R., Noppeney, U., 2010. Audiovisual synchrony improves
motion discriminationvia enhanced connectivitybetween early
visual and auditory areas. J. Neurosci. 30 (37), 12329–12339.
Mateeff, S., Hohnsbein, J., Noack, T., 1985. Dynamic visual capture:
apparent auditory motion induced by a moving visual target.
Perception 14 (6), 721–727.
McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices.
Nature 264 (5588), 746–748.
110 BRAIN RESEARCH 1466 (2012) 99–111
McKee, S.P., Nakayama, K., 1984. The detection of motion in the
peripheral visual field. Vision Res. 24 (1), 25–32.
Meredith, M.A., Stein, B.E., 1983. Interactions among converging
sensory inputs in the superior colliculus. Science 221 (4608),
389–391.
Meredith, M.A., Nemitz, J.W., Stein, B.E., 1987. Determinants of
multisensory integration in superior colliculus neurons. I.
temporal factors. J. Neurosci. 7 (10), 3215–3229.
Meyer, G.F., Wuerger, S.M., 2001. Cross-modal integration of auditory
and visual motion signals. Neuroreport 12 (11), 2557–2560.
Meyer, G.F., Wuerger, S.M., Röhrbein, F., Zetzsche, C., 2005.
Low-level integration of auditory and visual motion signals
requires spatial co-localisation. Exp. Brain Res. 166 (3–4), 538–547.
Meyer, G.F., Greenlee, M., Wuerger, S., 2011. Interactions between
auditory and visual semantic stimulus classes: evidence for
common processing networks for speech and body actions.
J. Cogn. Neurosci. 23 (9), 2291–2308.
Middlebrooks, J.C., Green, D.M., 1991. Sound localization by
human listeners. Annu. Rev. Psychol. 42, 135–159.
Morein-Zamir, S., Soto-Faraco, S., Kingstone, A., 2003. Auditory
capture of vision: examining temporal ventriloquism. Brain
Res. Cogn. Brain Res. 17 (1), 154–163.
Navarra, J., Vatakis, A., Zampini, M., Soto-Faraco, S., Humphreys,
W., Spence, C., 2005. Exposure to asynchronous audiovisual
speech extends the temporal window for audiovisual
integration. Brain Res. Cogn. Brain Res. 25 (2), 499–507.
Perrott, D.R., Ambarsoom, H., Tucker, J., 1987. Changes in head
position as a measure of auditory localization performance:
auditory psychomotor coordination under monaural and
binaural listening conditions. J. Acoust. Soc. Am. 82 (5),
1637–1645.
Perrott, D.R., Saberi, K., Brown, K., Strybel, T.Z., 1990. Auditory
psychomotor coordination and visual search performance.
Percept. Psychophys. 48 (3), 214–226.
Recanzone, G.H., 2009. Interactions of auditory and visual stimuli
in space and time. Hear. Res. 258 (1–2), 89–99.
Sanabria, D., Lupiañez, J., Spence, C., 2007a. Auditory motion
affects visual motion perception in a speeded discrimination
task. Exp. Brain Res. 178 (3), 415–421.
Sanabria, D., Spence, C., Soto-Faraco, S., 2007b. Perceptual and
decisional contributions to audiovisual interactions in the
perception of apparent motion: a signal detection study.
Cognition 102 (2), 299–310.
Senkowski, D., Talsma, D., Grigutsch, M., Herrmann, C.S.,
Woldorff, M.G., 2007. Good times for multisensory integration:
effects of the precision of temporal synchrony as revealed by
gamma-band oscillations. Neuropsychologia 45 (3), 561–571.
Slutsky, D.A., Recanzone, G.H., 2001. Temporal and spatial
dependency of the ventriloquism effect. Neuroreport 12 (1),
7–10.
Soto-Faraco, S., Lyons, J., Gazzaniga, M., Spence, C., Kingstone, A.,
2002. The ventriloquist in motion: illusory capture of dynamic
information across sensory modalities. Brain Res. Cogn. Brain
Res. 14 (1), 139–146.
Soto-Faraco, S., Spence, C., Lloyd, D., Kingstone, A., 2004. Moving
multisensory research along: motion perception across
sensory modalities. Curr. Dir. Psychol. Sci. 13, 29–32.
Spence, C., Senkowski, D., Röder, B., 2009. Crossmodal processing.
Exp. Brain Res. 198 (2–3), 107–111.
Stein, B.E., Meredith, M.A., 1993. The Merging of the Senses. MIT
Press, Cambridge, MA.
Stekelenburg, J.J., Vroomen, J., 2009. Neural correlates of
audiovisual motion capture. Exp. Brain Res. 198 (2–3),
383–390.
Strybel, T.Z., Vatakis, A., 2004. A comparison of auditory and
visual apparent motion presented individually and with
crossmodal moving distractors. Perception 33 (9), 1033–1048.
Teder-Sälejärvi, W.A., Di Russo, F., McDonald, J.J., Hillyard, S.A.,
2005. Effects of spatial congruity on audio-visual multimodal
integration. J. Cogn. Neurosci. 17 (9), 1396–1409.
Thurlow, W.R., Jack, C.E., 1973. Certain determinants of the
"ventriloquism effect". Percept. Mot. Skills 36 (3), 1171–1184.
To, M.P.S., Regan, B.C., Wood, D., Mollon, J.D., 2011. Vision out of
the corner of the eye. Vision Res. 51 (1), 203–214.
Van Wanrooij, M.M., Bremen, P., John Van Opstal, A., 2010.
Acquired prior knowledge modulates audiovisual integration.
Eur. J. Neurosci. 31 (10), 1763–1771.
Wallace, M.T., Roberson, G.E., Hairston, W.D., Stein, B.E., Vaughan,
J.W., Schirillo, J.A., 2004. Unifying multisensory signals across
time and space. Exp. Brain Res. 158 (2), 252–258.
Welch, R.B., Warren, D.H., 1980. Immediate perceptual response to
intersensory discrepancy. Psychol. Bull. 88 (3), 638–667.
Werner, S., Noppeney, U., 2010. Superadditive responses in
superior temporal sulcus predict audiovisual benefits in object
categorization. Cereb. Cortex 20 (8), 1829–1842.
Werner, S., Noppeney, U., 2011. The contributions of transient and
sustained response codes to audiovisual integration. Cereb.
Cortex 21 (4), 920–931.
Wilson, W.W., O'Neill, W.E., 1998. Auditory motion induces
directionally dependent receptive field shifts in inferior
colliculus neurons. J. Neurophysiol. 79 (4), 2040–2062.
Wuerger, S.M., Hofbauer, M., Meyer, G.F., 2003. The integration of
auditory and visual motion signals at threshold. Percept.
Psychophys. 65 (8), 1188–1196.
111BRAIN RESEARCH 1466 (2012) 99–111