ArticlePDF Available

Crossmodal interactions and multisensory integration in the perception of audio-visual motion - A free-field study

May 2012
Brain Research 1466:99-111

May 2012
1466:99-111

DOI:10.1016/j.brainres.2012.05.015

Source
PubMed

Authors:

Claudia Freigang

Hearing Diagnostics Limited

Rudolf Rübsamen

University of Leipzig

Motion perception can be altered by information received through multiple senses. So far, the interplay between the visual and the auditory modality in peripheral motion perception is scarcely described. The present free-field study investigated audio-visual motion perception for different azimuthal trajectories in space. To disentangle effects related to crossmodal interactions (the influence of one modality on signal processing in another modality) and multisensory integration (binding of bimodal streams), we manipulated the subjects' attention in two experiments on a single set of moving audio-visual stimuli. Acoustic and visual signals were either congruent or spatially and temporally disparate at motion offset. (i) Crossmodal interactions were studied in a selective attention task. Subjects were instructed to attend to either the acoustic or the visual stream and to indicate the perceived final position of motion. (ii) Multisensory integration was studied in a divided attention task in which subjects were asked to report whether they perceived unified or separated audio-visual motion offsets. The results indicate that crossmodal interactions in motion perception do not depend on the integration of the audio-visual stream. Furthermore, in the crossmodal task, both visual and auditory motion perception were susceptible to modulation by irrelevant streams, provided that temporal disparities did not exceed a critical range. Concurrent visual streams modulated auditory motion perception in the central field, whereas concurrent acoustic streams attracted visual motion information in the periphery. Differential abilities between the visual and auditory system when attempting to accurately track positional information along different trajectories account for the observed biasing effects.

Content uploaded by Rudolf Rübsamen

Content may be subject to copyright.

Research Report

Crossmodal interactions and multisensory integration in the

perception of audio-visual motion —A free-field study

Kristina Schmiedchen

⁎, Claudia Freigang, Ines Nitsche, Rudolf Rübsamen

Faculty of Biosciences, Pharmacy and Psychology, University of Leipzig, Talstrasse 33, 04103 Leipzig, Germany

ARTICLE INFO ABSTRACT

Article history:

Accepted 6 May 2012

Available online 14 May 2012

Motion perception can be altered by information received through multiple senses. So far,

the interplay between the visual and the auditory modality in peripheral motion perception

is scarcely described. The present free-field study investigated audio-visual motion

perception for different azimuthal trajectories in space. To disentangle effects related to

crossmodal interactions (the influence of one modality on signal processing in another

modality) and multisensory integration (binding of bimodal streams), we manipulated the

subjects’attention in two experiments on a single set of moving audio-visual stimuli.

Acoustic and visual signals were either congruent or spatially and temporally disparate at

motion offset. (i) Crossmodal interactions were studied in a selective attention task.

Subjects were instructed to attend to either the acoustic or the visual stream and to indicate

the perceived final position of motion. (ii) Multisensory integration was studied in a divided

attention task in which subjects were asked to report whether they perceived unified or

separated audio-visual motion offsets. The results indicate that crossmodal interactions in

motion perception do not depend on the integration of the audio-visual stream.

Furthermore, in the crossmodal task, both visual and auditory motion perception were

susceptible to modulation by irrelevant streams, provided that temporal disparities did not

exceed a critical range. Concurrent visual streams modulated auditory motion perception in

the central field, whereas concurrent acoustic streams attracted visual motion information

in the periphery. Differential abilities between the visual and auditory system when

attempting to accurately track positional information along different trajectories account

for the observed biasing effects.

Keywords:

Audio-visual

Motion perception

Dynamic capture

Perceived unity

Trajectory

Attention

1. Introduction

Objects that surround us usually stimulate more than a single

sense. Various streams of information that belong to the same

object are integrated in the brain, while simultaneously those

streams that belong to different objects are separated. Over

the past decades, a wide range of studies dealing with the

perception of stationary events has led to a substantial

improvement of our understanding of how multisensory

inputs are combined into meaningful percepts (Calvert et al.,

2001; Ernst and Bülthoff, 2004; McGurk and MacDonald, 1976;

Senkowski et al., 2007; Teder-Sälejärvi et al., 2005; Werner and

BRAIN RESEARCH 1466 (2012) 99–111

⁎Corresponding author. Fax: +49 341 9736848.

E-mail address: k.schmiedchen@uni-leipzig.de (K. Schmiedchen).

doi:10.1016/j.brainres.2012.05.015

Available online at www.sciencedirect.com

www.elsevier.com/locate/brainres

Noppeney, 2011). However, the specifics of the interplay, i.e.

the mutual influence between different senses in the percep-

tion of motion, are described in much less detail.

Motion perception requires the dynamic integration of spatial

changes over time. In vision, a moving light pattern consecu-

tively activates adjacent locations on the retina, which in turn

have to be associated by motion detectors in the visual cortex

(Albright and Stoner, 1995). The auditory system, in contrast, has

to track dynamically changing interaural time differences (ITDs)

and/or interaural intensity differences (IIDs) in order to infer

acoustic motion (Middlebrooks and Green, 1991). As such,

dynamic location coding achieved by a direct, retinotopic based

representation in the visual system and an indirect, recon-

structed representation in the auditory system fundamentally

differs between both modalities (Wilson and O'Neill, 1998).

A wide range of studies making use of various methodologies

such as single-cell recordings in animals, human behavior, EEG

and fMRI recordings or statistic modeling, all seem to indicate

that the combination of multisensory inputs requires the

analysis of spatial and/or temporal features between different

streams of sensory information. Asynchronous inputs are linked

as long as they fall within a defined temporal window (Colonius

and Diederich, 2010; Lewald and Guski, 2003; Meredith et al.,

1987; Navarra et al., 2005; Slutsky and Recanzone, 2001).

Similarly, spatial disparities between acoustic and visual inputs

are tolerated to a certain degree in static events (Bertelson and

Radeau, 1981; Hairston et al., 2003; Thurlow and Jack, 1973; Welch

and Warren, 1980)andmovingevents(Alink et al., 2008; Meyer

and Wuerger, 2001; Soto-Faraco et al., 2002; Stekelenburg and

Vroomen, 2009). Other studies have highlighted the crucial role

of the co-analysis of spatial and temporal features in multisen-

sory perception (Bolognini et al., 2005; Lewald and Guski, 2003;

Meyer et al., 2005; Recanzone, 2009; Soto-Faraco et al., 2004; Stein

and Meredith, 1993).

Importantly, multisensory integration refers to the binding

of stimuli perceived through multiple senses, whereas cross-

modal interactionsdescribe the directinfluence of one modality

on signal processing in another modality without necessarily

integrating information (Spence et al., 2009). In particular, the

term crossmodal localization bias is used to describe the

displacement of the perceived location of a signal in the

attended modality towards the location of a concurrent, but

ignored signal in another modality (Bertelson et al., 2000;

Hairston et al., 2003).

A well described crossmodal interaction phenomenon is the

ventriloquist illusion. Ventriloquism refers to a bias of per-

ceived information towards the information of a ‘competing’

sense that has either a more precise spatial or temporal

resolution. The first studies on ventriloquism in motion

perception supported the notion of a visual dominance (Allen

and Kolers, 1981; Mateeff et al., 1985), as the visual system was

found to provide more salient cues in perceiving motion than

the auditory or the somatosensory system. Crossmodal dy-

namic capture of auditory motion was also demonstrated by

more recent studies (Kitajima and Yamashita, 1999; Sanabria et

al., 2007b; Soto-Faraco et al., 2002; Strybel and Vatakis, 2004),

where an alteration of the perceived direction of auditory

motion towards that of the visual signal was attributed to a

‘capture’by the visual system which has superior spatial

resolution in the central field. In contrast, laterally moving

sounds were shown to induce motion perception of static visual

flashes (Hidaka et al., 2011), which demonstrates that the

auditory system can bias the visual system in the spatial

domain in instances where the visual resolution is inferior to

the auditory resolution. This finding is consistent with previous

studies employing stationary events (Alais and Burr, 2004;

Hairston et al., 2003). The unrestricted dominance of the visual

modality in motion perception was further challenged by

additional reports of modulatory effects of a moving sound on

the perceived direction of visual motion (Alink et al., 2012;

Brooks et al., 2007; Jain et al., 2008) or in resolving ambiguous

visual motion displays (Sanabria et al., 2007a). Also, the timing

of a stationary sound has been shown to affect the detection or

the perceived direction of moving visual signals (Freeman and

Driver, 2008; Getzmann, 2007; Kafaligonul and Stoner, 2010).

Taken together, the results of these studies suggest context-

dependent crossmodal interactions between the visual and the

auditory modality in motion perception and support the

hypothesis that perception is dominated by the modality

which provides the most reliable information (Battaglia et al.,

2003; Burr and Alais, 2006; Ernst and Banks, 2002).

Unlike crossmodal interactions, multisensory integration

necessarily implicates the binding of various streams of

information and yields a coherent percept. Ernst and

Bülthoff (2004) pointed out that the integration of multisen-

sory inputs reduces the variance between the individual

sensory estimates and leads to enhanced reliability of the

combined percept. Integration effects have been observed at

very distinct processing stages, from low-level structures

such as the superior colliculus (Meredith and Stein, 1983;

Meredith et al., 1987) to higher-order cortical areas such as

the superior temporal sulcus (Calvert et al., 2001; Werner and

Noppeney, 2010). Consequently, Meyer et al. (2011) suggested

that these different stages are responsible for the integration

of various contents of bimodal stimuli. That is, sensory

integration (e.g. spatial and temporal features) is likely

reflected in low-level structures, whereas semantic integra-

tion (e.g. linguistic contents) is assigned to higher-order

structures. Psychophysical studies investigating low-level

integration of motion features found improved speed estima-

tion (Wuerger et al., 2003) and an increase in the detection

rate (Meyer et al., 2005) in the bimodal condition. An fMRI

study by Lewis and Noppeney (2010) on higher-level integra-

tion revealed that audio-visual synchrony facilitates motion

discrimination. Furthermore, both behavioral and neurophys-

iological studies have shown that the integration of two or

more streams of information into a unified percept is optimal

for spatially and temporally aligned features (Stein and

Meredith, 1993; Wallace et al., 2004). In conflicting audio-

visual situations, the integration or separation of several

streams depends on the magnitude of the spatial and/or

temporal disparities, as previously suggested by the findings

of behavioral studies employing stationary events (Bertelson

and Radeau, 1981; Lewald and Guski, 2003; Slutsky and

Recanzone, 2001; Wallace et al., 2004).

A limitation of most of the previous studies is that moving

stimuli were mainly presented in the central visual field. To date,

interactions between the visual and the auditory system in

perceiving motion in the periphery—where the interplay between

the two senses might differ—have been seldom investigated.

100 BRAIN RESEARCH 1466 (2012) 99–111

The present free-field study aimed at better understanding

how crossmodal interactions and multisensory integration

contribute to audio-visual motion perception in central and

peripheral space. To this end, a single set of audio-visual

stimuli was employed in two different experimental tasks. The

subjects’attention was either (i) selectively directed to a single

modality or (ii) simultaneously directed to both modalities.

The first experiment of this study aimed at elucidating the role

of crossmodal interactions in the perception of motion in

space. Moving audio-visual stimuli and moving unimodal

stimuli (acoustic or visual), traveling along different prede-

fined trajectories at a stimulus duration of either 0.5 s or 2.0 s,

were randomly presented in the free-field (Fig. 1A). Audio-

visual stimuli were either congruent or spatially and tempo-

rally disparate at motion offset (Figs. 1B and C). In a blocked

design, subjects were instructed to selectively attend to either

the acoustic or the visual stream and to localize the final

position of the attended stimulus. At central motion offset

locations, we expected a shift of the perceived final position of

acoustic trajectories towards the displaced final position of

visual trajectories. However, as positional accuracy in visual

motion perception decreases towards the peripheral field

(McKee and Nakayama, 1984) and thus the reliability of visual

signals, we hypothesized that auditory motion perception

would be less influenced by concurrent visual information at

lateral motion offset locations.

In a second experiment, we studied multisensory integra-

tion of motion stimuli. Using the same audio-visual stimuli as

in the first experiment, subjects were instructed to simulta-

neously attend to both streams and to judge after each trial

whether the visual and the acoustic motion offsets were

congruent or incongruent. We hypothesized that increasing

spatio-temporal disparities between the visual and the

acoustic motion offset would lead to increased rates of

separation of both streams. Furthermore, we predicted that

the rate of perceived unity in conflicting trials is reduced at

central motion offset locations due to enhanced stimulus

reliability in both modalities.

2. Results

2.1. Experiment 1: audio-visual localization task

Visual and auditory localization performances in the selective

attention task were contrasted with each other to evaluate the

ACOUSTIC CONSTANT VISUAL CONSTANT

UNIMODAL

-15°

-10°

-5°

0°

+5°

+10°

+15°

38° 38°

-60°

-38°

-8° +8°

+38°

+60°

2.35 m

(A) (B)

-15° -10° -5° 0° +5° +10° +15°

Constant signal

Variable signal

SPATIAL DISPARITY

-195 ms -130 ms -65 ms 0ms

-780 ms -520 ms -260 ms 0ms

+65 ms

+260 ms

+130 ms

+520 ms

+195 ms

+780 ms

TEMPORAL DISPARITY

0.5 s signal duration:

2.0 s signal duration:

(C)

Fig. 1 –Experimental setup and stimulus conditions. (A) Array of 47 loudspeakers and 188 LEDs (dots in front of the

loudspeaker symbols) mounted in an azimuthal, semicircular arrangement at a distance of 2.35 m from the subject's head.

Locations in the left and right hemifield are denoted by negative and positive signs, respectively. Arrows indicate the

trajectories of motion stimuli which move towards the target positions from either side. Motion offset locations in the constant

modality are shown by blue loudspeaker symbols and yellow LED symbols. Motion offset locations of the concurrent signal in

the second modality (not indicated) during bimodal stimulation varied with respect to a predefined spatio-temporal disparity

(panel B and C). (B) Uppermost row: Unimodal stimuli (left—acoustic only; right—visual only) traveling an angular range of 38°.

Left side: ‘acoustic constant’, right side: ‘visual constant’. In the audio-visual localization task (experiment 1) two different

attentional conditions were tested. The attended signal traveled a constant angular range (38°), while the unattended signal,

which concurrently started from the same position and traveled with the same angular speed, stopped before, simultaneously

or after motion offset of the signal in the attended modality. Spatial disparities between the attended and unattended signals

were systematically varied between −15° and +15° in steps of 5°. Note, that spatial disparities at motion offset co-varied with

temporal disparities (panel C). The same set of stimuli, with the exception of the unimodal stimulus conditions, was also

employed in the task on detecting spatio-temporal disparities (experiment 2). (C) Relationship of spatial and temporal

disparities at motion offset between signals in the constant and the variable modality. The employed spatial disparities were

identical for 0.5 s and 2.0 s signals, but temporal disparities co-varied both with spatial disparity and signal duration.

101BRAIN RESEARCH 1466 (2012) 99–111

influence of concurrent visual streams on auditory motion

perception and vice versa. In Fig. 2 localization performance at

different motion offset locations in space is plotted as a

function of spatial disparity between the attended and

unattended modality. For reasons of clarity, the respective

temporal disparities are not indicated in the following figures.

Note, however, that temporal offset disparities varied both

as a function of spatial disparity and overall signal duration

(Fig. 1C, Table 1). Mean values are displayed for normalized

data, i.e. relative to the respective unimodal localization

performance. For the statistical comparison between visual

and auditory localization performance, data of the two

stimulus durations (0.5 s, 2.0 s) were submitted to separate

four-way repeated measures analysis of variance (ANOVA),

including the within-subject factors (i) attended modality

(‘attend auditory’,‘attend visual’), (ii) spatial location of

motion offset (8°, 38°, 60°), (iii) motion direction (towards

the periphery, towards the midline) and (iv) spatio-temporal

disparity (for 0.5 s signal duration: −15° (−195 ms), −10°

(−130 ms), −5° (−65 ms), 0° (0 ms), +5° (+65 ms), +10°

(+130 ms), +15° (+ 195 ms); for 2.0 s signal duration (−15°

(−780 ms), −10° (−520 ms), −5° (−260 ms), 0° (0 ms), + 5°

(+260 ms), +10° (+ 520 ms), + 15° (+ 780 ms)).

2.1.1. Signal duration 0.5 s

Localization performance in both the ‘attend auditory’and

the ‘attend visual’condition was affected by concurrent

motion streams in the respective unattended modality

(Fig. 2A). Auditory motion perception mainly varied as a

function of the displaced visual stimulus at central motion

offset locations. The mean values indicate that the magnitude

of localization bias towards the respective final position of

visual trajectories increased with increasing spatio-temporal

disparities. This trend was confirmed by a highly significant

main effect of spatio-temporal disparity (F

(6,150)

= 25.55; p< 0.001,

=0.505). The localization biases were noticeably reduced, but

still significant, when trajectories with the same disparities

between the visual and the acoustic signal terminated at

paracentral (F

(6,150)

=9.63; p< 0.001, η

=0.278) and lateral motion

offset locations (F

(6,150)

=2.51; p =0.044, η

=0.091). Opposite

effects were observed for the ‘attend visual’condition. Concur-

rent acoustic streams had the strongest influence on visual

motion perception at lateral motion offset locations (F

(6,150)

=53.12;p<0.001,η

= 0.680). That is, the perceived final positions

of visual trajectories were shifted towards the respective final

positions of acoustic trajectories. This biasing effect was

reduced, though still significant at paracentral (F

(6,150)

=29.58;

ATTEND VISUAL

ATTEND AUDITORY

LATERAL

PARACENTRAL

CENTRAL

ATTEND VISUAL

ATTEND AUDITORY

(A)

0.5 seconds

(B) 2.0 seconds

Spatial disparity

Localization bias (deg)

-5

-15°

-10°

-5°

0°

+5°

+10°

+15°

-5

Fig. 2 –Mean localization bias in the attended modality. Blue symbols: ‘attend auditory’, yellow symbols: ‘attend visual’.

Localization performance is given for central, paracentral and lateral motion offset locations (data are collapsed across both

hemifields). The respective trajectories are indicated on top of the graph. Red circles denote motion offset locations in the

attended modality. Normalized data are plotted as a function of spatial disparity between the acoustic and visual motion offset.

Note, that temporal disparities are not indicated in the graph. The dotted, grey lines refer to the performance in localizing the

respective unimodal control stimulus. Mean values deviating from zero indicate a localization bias towards the non-attended

modality. Lower and upper error bars represent the 25th and the 75th percentile, respectively. A) 0.5 s signal duration in the

attended modality B) 2.0 s signal duration in the attended modality.

102 BRAIN RESEARCH 1466 (2012) 99–111

p<0.001,η

=0.542) and central motion offset locations (F

(6,150)

7.76; p< 0.001, η

=0.237).

The comparison of visual and auditory localization perfor-

mance in a four-way repeated measures ANOVA revealed a

main effect of spatio-temporal disparity (F

(6,150)

=82.99; p < 0.001,

=0.768). Localization biases in the attended modality were

correlated with spatio-temporal disparities, i.e. the largest

negative shifts were observed for disparities of −15° (−195 ms),

whereas the largest positive shifts occurred for disparities of + 15°

(+195 ms). The overall biasing effect of concurrent acoustic

streams on visual motion perception was larger than the reverse

effect for disparities of −10° (−130 ms), + 10° (+130 ms) and +15°

(+195 ms), which is confirmed by an interaction of attended

modality and spatio-temporal disparity (F

(6,150)

=2.70; p = 0.038,

=0.097). The inverted pattern of mutual influences between

modalities at central and lateral motion offset locations is

reflected in the significant interaction of attended modality,

spatial location of motion offset and spatio-temporal disparity

(12,300)

=8.31; p<0.001, η

=0.249). When subjects indicated the

final position of moving acoustic signals that terminated at

central locations, the magnitude of localization bias varied as a

function of the displaced visual motion offset, whereas the

localization of visual motion offsets in the central field was much

less affected by the displaced acoustic motion offset. However, at

lateral motion offset locations, it was visual motion perception

that was biased in the presence of concurrent acoustic streams

and the magnitude of the respective localization bias depended

on the displaced acoustic motion offset. Concurrent visual

streams, in contrast, were not as effective in biasing the perceived

final position of acoustic trajectories in the periphery. Finally, at

paracentral motion offset locations, comparable biasing effects

towards the unattended modality were observed between the

‘attend auditory’and ‘attend visual’condition.

2.1.2. Signal duration 2.0 s

For longer signal durations, localization performance in the

target modality was less affected or even unaffected by

concurrent motion streams in the unattended modality

(Fig. 2B). Specifically, mean values in the ‘attend visual’

condition were largely comparable to unimodal localization

performance at each spatial location of motion offset. This

indicates that concurrent acoustic motion streams hardly

affected the perceived final position of visual motion. Biasing

effects were observed in the ‘attend auditory’condition at

paracentral (F

(6,150)

=5.73; p = 0.001, η

=0.186) and lateral mo-

tion offset locations (F

(6,150)

=6.55; p < 0.001, η

=0.208). Surpris-

ingly, the localization biases were towards the opposite

direction (i.e. away from the final position of visual motion).

The comparison of visual and auditory localization perfor-

mance in a four-way repeated measures ANOVA revealed a

main effect of attended modality (F

(1,25)

=5.75; p = 0.024,

=0.187), which was due to the biasing effects in the ‘attend

auditory’condition. More pronounced biasing effects were

observed for signals moving towards the periphery (F

(1,25)

10.19; p= 0.004, η

=0.290). Furthermore, the magnitude of

localization bias depended on the spatio-temporal disparity

(6,150)

=12.72; p< 0.001, η

=0.337), i.e. localization of the acous-

tic signal was increasingly biased away from the locationof the

visual motion offset with increasing spatio-temporal dispar-

ities. The interaction of attended modality and spatio-temporal

disparity (F

(6,150)

=7.25; p<0.001, η

=0.225) is explained by the

fact that the magnitude of the localization bias towards the

opposite direction varied as a function of spatio-temporal

disparity particularly in the ‘attend auditory’condition. Addi-

tionally, the significant interaction of spatial location of motion

offset and spatio-temporal disparity (F

(12,300)

=3.15; p =0.004,

=0.112) is dueto the fact that localization performance varied

as a function of spatio-temporal disparity at paracentral and

lateral motion offset locations, but was not affected at central

motion offset locations.

2.1.3. Comparing localization performance for 0.5 s and 2.0 s

signals

The employed spatial disparities at motion offset were

identical for short and long signal durations. However, the

presentation of the same spatial offset disparity in signals

that differ in their overall duration implicated a different

temporal offset disparity between the attended and unat-

tended signal, a feature inherent to the variation in the

duration of moving signals. This temporal variation led to

notable differences in localization performance between both

signal durations. Effects of unattended motion streams on

localization performance in the attended modality were

predominantly observed for signals at a duration of 0.5 s, in

which spatial disparities in steps of 5° between −15° and +15°

co-varied with temporal disparities in steps of 65 ms between

−195 ms and + 195 ms. However, the covariation of spatial and

temporal disparities differed in signals at a duration of 2.0 s.

Spatial disparities between −15° and +15° were by accompanied

by larger temporal disparities in steps of 260 ms between

−780 ms and + 780 ms, resulting in a considerably reduced

influenceof the unattended signal on localization performance

in the target modality.

2.2. Experiment 2: integration and separation of audio-

visual motion streams

Reports of perceived unity or separation of moving audio-

visual stimuli (Fig. 3) were classified into four groups,

according to the respective constant signal (acoustic or visual)

and stimulus duration (0.5 s or 2.0 s). Statistical analysis was

based on separate log-linear models for the data of the two

stimulus durations. The included variables were the same as

in experiment 1 (see Section 2.1).

2.2.1. Signal duration 0.5 s

Subjects predominantly reported perceived unity for congru-

ent audio-visual signals or for signals with spatio-temporal

disparities of −/+5° (−/+ 65 ms) between the acoustic and the

visual motion offset (Fig. 3A).

With increasing spatio-temporal disparities the probability

to integrate both signals constantly decreased as reflected in a

one-way association of spatio-temporal disparity (χ

2(0.95,6)

101.7; p<0.001). A two-way association of spatio-temporal

disparity and constant modality (χ

2(0.95,6)

=60.90; p< 0.001) is

explained by an increased probability to separate signals

when the visual trajectory exceeds the acoustic trajectory in

terms of motion path and duration. That is, for the constant

acoustic trajectory, the percentage of reported unity, in

particular for spatial disparities of +10° (+130 ms) and +15°

103BRAIN RESEARCH 1466 (2012) 99–111

(+195 ms) (visual trajectory longer than the acoustic trajectory),

was considerably smaller compared to spatio-temporal dispar-

ities of −10° (−130 ms) and −15° (−195 ms) (visual trajectory

shorter than the acoustic trajectory). Likewise, when the visual

trajectory was constant, the probability to integrate both signals

was lower for spatio-temporal disparities of −10° (−130 ms) and

−15° (−195 ms) (visual trajectory longer than the acoustic

trajectory) compared to spatio-temporal disparities of +10°

(+130 ms) and +15° (+ 195 ms) (visual trajectory shorter than

the acoustic trajectory). No one-way associations or higher-

order associations were found for the factors spatial location of

motion offset or motion direction. This indicates that the ratios

of integration and separation are similar for stimulus combina-

tions with identical spatio-temporal disparities, but traveling

along different trajectories.

2.2.2. Signal duration 2.0 s

Reports of perceived unity were mainly observed for congru-

ent trials. A lower percentage of audio-visual signals was

integrated at spatio-temporal disparities of −/+5° (−/+260 ms)

and signals were almost exclusively perceived as separated

for spatio-temporal disparities of −/+10° (−/+520 ms) and −/+

15° (−/+780 ms) (Fig. 3B). A one-way association of spatio-

temporal disparity (χ

2(0.95,6)

=858.1; p< 0.001) confirms the

increased probability of stimulus separation with increasing

disparities between both streams. Furthermore, the percent-

age of reported integration was lower when the moving visual

signal's termination point exceeded its acoustic counterpart.

For example, for trials in which the acoustic trajectory was

constant, the percentage of reported signal unity was about

70% when the visual trajectory was shorter than the acoustic

trajectory (−5°, −260 ms), but considerably lower (about 20%)

when the visual trajectory exceeded the acoustic trajectory by

+5° (+260 ms). Response behavior was similar for trials in

which the visual trajectory was constant. This finding was

confirmed by a two-way association of spatio-temporal

disparity and constant modality (χ

2(0.95,6)

=100.19; p < 0.001).

Again, there were no one-way associations or higher-order

associations including the factors spatial location of motion

offset or motion direction, indicating that stimulus combina-

tions with identical spatio-temporal disparities were equally

likely to be integrated regardless of the traveled trajectory.

2.2.3. Comparing discordance detection for 0.5 s and 2.0 s

signals

The rate of perceived unity showed a pronounced asymmetry

between short (Fig. 3A) and long signal durations (Fig. 3B). As

described for experiment 1, one has to take into account that

temporal disparities at motion offset covary both with spatial

disparity and overall signal duration which notably influenced

performance in detecting incongruent trials. The same spatial

disparities ranging between −15° and +15° were accompanied

by much larger temporal disparities in 2.0 s signals which is

reflected in overall lower probabilities to integrate both signals.

100

-15° -10° -5° 0° +5° +10° +15°

100

-15° -10° -5° 0° +5° +10° +15°

(A) 0.5 seconds

(B) 2.0 seconds

ACOUSTIC CONSTANT VISUAL CONSTANT

Spatial disparity

Percentage reports of unity (%)

47° to 8°

30° to 8°

77° to 38°

0° to 38°

90° to 60°

22° to 60°

Fig. 3 –Percentage of reports of perceived unity in the constant modality. Percentages are given for each trajectory and each

spatial disparity between the acoustic and visual motion offset (data are collapsed across both hemifields). Note, that temporal

disparities are not indicated in the graph. Different colors indicate different trajectories as illustrated on top of the graph and

specified in the legend. Black circles denote motion offset locations in the constant modality. A) 0.5 s signal duration in the

constant modality B) 2.0 s signal duration in the constant modality.

104 BRAIN RESEARCH 1466 (2012) 99–111

Still, there were two general principles that applied to both

signal durations: when the visual trajectory exceeded the

acoustic trajectory, the probability of perceiving separate

signals was increased compared to signals in which the

acoustic trajectoryexceeded the visual trajectory. Furthermore,

the proportion of perceived unity and separation for identical

spatio-temporal disparities was not affected by the traveled

trajectory of the audio-visual stimulus.

3. Discussion

The goal of the current experiments was to disentangle the

contributions of crossmodal interactions and multisensory

integration to the perception of audio-visual motion in space.

So far, spatial effects have not been investigated in detail. In

the present study, the location of the trajectories’termination

point (central vs. lateral) has proven to be a crucial factor for

the crossmodal interplay between the visual and the auditory

system. However, the location of the trajectories’termination

point did not influence the ratio of perceived unity and

separation of audio-visual motion streams.

We found that, in particular for short signal durations, the

magnitude of crossmodal localization bias towards the

location of motion offset in the non-attended modality varied

as function of spatio-temporal disparity between the visual

and acoustic motion offset. However, when the task required

a decision on whether the audio-visual stimuli were spatio-

temporally aligned or not, the percentage of reported unity

decreased with increasing spatio-temporal disparities. This

indicates that crossmodal interactions do not depend on the

integration of the audio-visual motion stream. Moreover, the

results suggest that subjects benefited jointly from spatial and

temporal cues in the perception of audio-visual motion and

that temporal disparities at motion offset likely explain the

observed asymmetries in performance between short and

long signal durations in both experiments.

3.1. Crossmodal interactions between the visual and the

auditory system depend on the trajectory in space

The findings of the audio-visual localization task indicate

mutual crossmodal interactions between the visual and the

auditory system in the perception of motion. Thereby, the

influenceof the unattended signal on localization performance

in the attended modality crucially depended on the specific

trajectory as well as on the spatio-temporal disparity between

the visual and the acoustic motion offset. Due to the fact that

the stimuli were initially congruent, the modality determining

the percept in conflicting trials could only emerge at the final

part of the moving stimulus. Crossmodal interactions between

the visual and the auditory modality were mainly observed for

short stimulus durations (0.5 s), but were found to be much

weaker for longer signal durations (2.0 s). Even though the

predefined spatial conflicts were identical for both signal

durations (−/+5°, −/+10°, −/+15°), moving audio-visual stimuli

differed with respect to temporal offset disparities (65 ms to

195 ms in short signals and 260 ms to 780 ms in long signals),

which likely explains the discrepancy in localization perfor-

mance between short and long signal durations. For 0.5 s

signals, visual motion perception was most susceptible to be

influenced by concurrent acoustic streams at lateral motion

offset locations, whereas auditory motion perception was

biased towards visual information mainly at central motion

offset locations. The results do not support the notion of an

existing asymmetry between modalities in the perception of

motion (Soto-Faraco et al., 2004), i.e. a visual dominance, but

rather highlight the crucial role of the auditory modality when

the reliability of the visual stimulus is reduced (Alais and Burr,

2004; Hidaka et al., 2011). Our results might therefore reflect

differential abilities between the visual and the auditory system

when attempting to accurately track dynamically changing

location cues along different trajectories in space. The process-

ing of visual motion is most accurate in foveal regions, and this

accuracy in detecting positional changes decreases remarkably

in the peripheral visual field (Finlay, 1982; McKee and

Nakayama, 1984; To et al., 2011). Studies on spatial acuity in

the auditory modality have shown that the minimum auditory

movement angle (MAMA) is smallest in the central field

(Grantham, 1986). However, Perrott et al. (1987, 1990) pointed

out that the resolution of the auditory system across space is

roughly homogeneous in relation to the visual resolution. That

is, optimal visual resolution is only possible in the fovea,

whereas the acuity in perceiving sound sources is much more

constant across space. This crucial discrepancy between both

modalities is likely the key underlying feature that accounts for

the results of the present study. On the one hand, our results

show that concurrent visual motion streams ‘capture’dynamic

auditory information at central motion offset locations, thus

confirming the results of previous studies on audio-visual

motion perception (e.g. Allen and Kolers, 1981; Mateeff et al.,

1985; Soto-Faraco et al., 2002; Strybel and Vatakis, 2004). Visual

motion perception, in contrast, was only slightly biased towards

the final position of the acoustic trajectory in the central field.

This suggests that foveal tracking of motion features in the

visual system is more precise than dynamic tracking of

interaural differences at midline positions in the auditory

system. Indeed, during dynamic visual capture, Alink et al.

(2008) demonstrated reduced activity in the auditory motion

complex (AMC) and simultaneously enhanced activity in the

human middle temporal area. The authors concluded that

altered activity in early auditory motion areas reflects a

neuronal correlate of visual dominance.

On the other hand, our results also demonstrate that

concurrent acoustic streams ‘capture’visual motion informa-

tion at lateral motion offset locations, whereas concurrent

visual streams were not as effective in biasing auditory motion

perception in the periphery. We hypothesize that superior

peripheral tracking of dynamically changing location cues in

the auditory system accounts for the observed biasing effects

on visual motion perception. Thisis in line with the notion that

spatial acuity in the periphery is more declined in the visual

system when compared to the auditory system, as previously

proposed by Perrott et al. (1987, 1990). Concerning the study by

Alink et al. (2008), the question arises whether an inverted

activation shift between auditory and visual motion areas

would be observed, once the auditory modality dominates

perception. That is, enhanced activity in the respective cortical

auditory areas could directly affect motion processing in visual

areas. The hypothesisof a gradual shift in the activation pattern

105BRAIN RESEARCH 1466 (2012) 99–111

is also supported by the fact that the influence of the respective

unattended modality on visual or acoustic localization perfor-

mance was equivalent at paracentral motion offset locations in

the present study. Thus, the visual and the auditory system

seem to be equal in their capabilities to track dynamic

information at intermediate spatial locations.

For 2.0 s signals, crossmodal interactions were only found

for the ‘attend-auditory’condition. The biasing effects were

much smaller and, interestingly, negatively correlated with

spatio-temporal disparity. Wallace et al. (2004) also found a

negative localization bias for static audio-visual events that

were either spatially or temporally disparate. In the present

study, concurrent acoustic and visual signals moved in the

same direction and differed only with regard to spatial and

temporal motion offsets. Specifically, in the case of prolonged

stimulus durations, we assume that multisensory predictions

about the trajectory of an ongoing motion are established in

the respective brain areas which are then violated when one

signal ceases abruptly. Van Wanrooij et al. (2010) demonstrat-

ed that multisensory expectations based on the spatial

congruence or incongruence of audio-visual stimuli of previ-

ous trials have an influence on reaction times and accuracy in

orienting towards the current audio-visual stimulus. Further-

more, the violation of multisensory expectations has been

shown to affect the dynamics of oscillatory activity in the

brain and reflects the detection of an intermodal conflict

(Arnal et al., 2011), which in turn might interfere with our

ability to properly localize motion stimuli. In the current

study, however, the detection of a discrepancy between both

streams seems to have an impact on auditory, but not on

visual localization performance. Though, it is debatable

whether the consequence of a putative prediction error can

be interpreted in terms of a crossmodal localization bias or

rather reflects a distinct phenomenon. Additional research is

needed to clarify this aspect.

3.2. Spatio-temporal analysis is crucial for the integration

and separation of audio-visual motion streams

In the second experiment we found that the performance in

detecting conflicting audio-visual trials improved with increasing

spatio-temporal disparities between the acoustic and the visual

motion offset. The results are therefore in line with previous

studies using stationary audio-visual events (Bertelson and

Radeau, 1981; Lewald and Guski, 2003; Slutsky and Recanzone,

2001; Wallace et al., 2004). As conflicting trials only differed at

motion offset, one can assume that the decision regarding the

integration or separation of both streams depended on the final

part of the audio-visual stimulus. Furthermore, we demonstrat-

ed the crucial role of the co-analysis of spatial and temporal

features in integrating multisensory information. Though the

probability of stimulus separation generally increased with

increasing spatial disparities both for 0.5 s and 2.0 s signals,

subjects’performance showed pronounced asymmetries with

respect to the absolute rate of stimulus separation. Inherent to

the prolongation of the overall signal duration were larger

temporal offset disparities between the acoustic and visual

stream in 2.0 s signals which likely account for the observed

differences. It can be assumed that the temporal disparities in

2.0 s signals did not fall within the temporal window of

integration (Colonius and Diederich, 2010). Consequently,

these signals were more likely to be perceived as separated

compared to those at a duration of 0.5 s.

Unexpectedly, the proportion of perceived unity and

separation for stimulus combinations with identical spatio-

temporal disparities was comparable across trajectories. This

finding applied to both signal durations. It can be reasoned

that subjects may have additionally relied on a comparison of

temporal offsets between the visual and acoustic signal. Thus,

the combined use of spatial and temporal cues seems to

provide redundant information at each spatial location of

motion offset.

Another important finding was the role of the temporal

order of visual and acoustic signal offsets for perceived signal

unity. Visual motion offsets prior to acoustic motion offsets

resulted in a higher percentage of reports of perceived unity

than vice versa. In other words, segregation of both streams

occurred more frequently when the visual trajectory exceeded

the acoustic trajectory. This finding is likely the direct result

of the fact that light travels more quickly than sound (King,

2005; Recanzone, 2009), i.e. the visual information is assumed

to reach the sensory receptors earlier than the associated

acoustic information. Therefore, events violating this fact, e.g.

when sound precedes light or when the visual trajectory

exceeds the acoustic trajectory, are not considered to origi-

nate from the same object (Morein-Zamir et al., 2003) and thus

the streams tend to be separated.

3.3. Crossmodal interactions and multisensory integration

in motion perception are independent processes

Crossmodal interactions between the visual and the auditory

system were predominantly observed for short signal dura-

tions. Depending on the specific traveled trajectory in space,

both the perceived auditory and visual motion information

were subject to a ‘capture’by the unattended modality.

Thereby, the magnitude of localization bias varied as a

function of spatio-temporal disparity between the visual and

the acoustic motion offset. Biasing effects were still observed

for the largest spatio-temporal disparities of -/+15° (−/

+195 ms). In contrast, a reverse trend was observed for the

subjects’reports of perceived integration or separation of both

sensory streams. Increasing spatio-temporal disparities led to

more reliable separations. Thus, it can be concluded that

integration of both streams is not a necessary prerequisite for

a perceptual bias of information towards the irrelevant

modality as previously suggested by Bertelson and Radeau

(1981). The results of the present study stand in contrast to

those of Wallace et al. (2004), where subjects were presented

with congruent and conflicting stationary audio-visual

events. The subjects were instructed to localize the position

of the acoustic signal and to indicate at the same time

whether the visual and the acoustic signal were perceived as

unified or separated. Hence, Wallace et al. (2004) did not

separately investigate crossmodal interactions and multisen-

sory integration. They showed that perceived unity was

correlated with localization of the stationary auditory signal

at or very near the location of the stationary visual signal. This

is not surprising, given the fact that subjects may have relied

on the non-ignored visual signal instead of using auditory

106 BRAIN RESEARCH 1466 (2012) 99–111

cues for localization. However, the assumption of unity

implies integration into a coherent percept whereby different

streams of information are perceived as emanating from the

same spatial location. Localization in a selective attention

task, in contrast, rather reflects crossmodal interactions

between senses without necessarily assuming that various

streams belong to the same object. In the present study,

localization biases towards the non-attended signal were

observed at intermediate positions between the distance of

the visual and the acoustic motion offset location. Thus, the

results support the existence of genuine localization biases

and do not reflect pure decisional strategies.

Furthermore, our data provide clear evidence for a co-

analysis of spatial and temporal cues in the perception of

bimodal motion streams. Though one cannot dissociate the

respective contribution of spatial and temporal cues to

behavioral performance for a given spatio-temporal disparity,

the comparison between short and long signal durations

revealed that multisensory interactions only occur when both

cues are provided within a specific range of tolerance. Both

interaction and integration effects were much weaker for 2.0 s

signals. Although these signals were presented with identical

predefined spatial disparities as those at a duration of 0.5 s,

temporal disparities were much larger and obviously exceeded

the temporal window for binding of various streams of informa-

tion. Temporal disparities are therefore likely to explain the

observed asymmetries both in the magnitude of localization bias

and the rate of perceived unity between 0.5 s and 2.0 s signals.

Previously, audio-visual interactions in motion perception

have mainly been studied in tasks in which subjects were

asked to discriminate the direction of motion in one modality

while ignoring a crossmodal distractor in another modality

(Allen and Kolers, 1981; Mateeff et al., 1985; Soto-Faraco et al.,

2002; Strybel and Vatakis, 2004). In the present study, cross-

modal interactions and multisensory integration were studied

with concurrently presented acoustic and visual signals that

moved in identical directions but were conflicting at motion

offset. The results support the notion that the visual modality

dominates motion perception in the central field which is

most likely due to superior tracking of positional information

at the midline (Soto-Faraco et al., 2004). However, tracking of

dynamically changing location cues in the periphery seems to

be more accurate in the auditory system as concurrent

acoustic motion streams biased visual motion perception at

lateral motion offset locations. The results of the current

study confirm that degraded reliability of the visual signal in

motion perception can be compensated for by the auditory

modality as previously proposed by Hidaka et al. (2011).

Taken together, the present findings indicate that different

trajectories in space alter the perceived quality of moving

visual and moving acoustic signals and thus their suscepti-

bility to capture by the other modality. The crossmodal

localization bias towards the location of the final position in

the non-attended modality still occurs even when the moving

audio-visual signal would be perceived as separated. This

finding implicates a more restricted range of tolerance for

spatial and/or temporal disparities in multisensory integra-

tion compared to crossmodal interactions, an aspect that

needs to be further considered in future studies investigating

multisensory interactions.

4. Conclusion

Our data provide evidence that the interplay of the visual and

the auditory modality in motion perception crucially depends

on the trajectory. The findings indicate that positional

information in the central field is more accurately tracked by

the visual system, since concurrent visual streams biased

auditory motion perception mainly at central motion offset

locations. The auditory system, in contrast, seems to be

superior to the visual system in tracking positional informa-

tion in the peripheral space as visual localization performance

was biased towards the final position of the acoustic

trajectory mainly at lateral motion offset locations. The

magnitude of localization bias thereby varied as a function

of the spatio-temporal disparity between the visual and the

acoustic stream. Importantly, the interplay between modali-

ties was only observed when temporal conflicts at motion

offset did not exceed a critical range. The results furthermore

suggest that crossmodal interactions occur independently

from the integration of the audio-visual motion stream.

5. Experimental procedure

5.1. Subjects

Twenty-six subjects (14 females, 12 males; 4 left-handed;

mean age: 26.4 years; age range 20–33 years) with normal or

corrected-to-normal vision and normal hearing abilities

participated both in experiment 1 and experiment 2. None of

the subjects reported any neurological disorder. All subjects

gave informed written consent and were compensated for

their participation. This study conformed to The Code of

Ethics of the World Medical Association and was approved by

the local Ethics Committee of the University of Leipzig.

5.2. Setup and stimuli

The experiments were conducted in an anechoic, sound

attenuated free-field laboratory (40 m

, Industrial Acoustics

Company [IAC]). Forty-seven broad-band loudspeakers (Visa-

ton, FRS8 4) were mounted in an azimuthal, semicircular array

at ear level (Fig. 1A). A comfortable, fixed chair was positioned

in the middle of the semicircle at a constant distance of 2.35 m

from the loudspeakers such that subjects were aligned

straight ahead to the central speaker at 0°. The loudspeaker

array covered an azimuthal plane from −98° to the left to +98°

to the right. The angular distance between two loudspeaker

membranes was 4.3°. Each loudspeaker was calibrated indi-

vidually. For this, the transmission spectrum was measured

using the Brüel & Kjær measuring amplifier (B&K 2610), a

microphone (B&K 2669, pre-amplifier B&K 4190) and a real-

time signal processor (RP 2.1, System3, Tucker Davis Technol-

ogies, TDT). For each loudspeaker a calibration file was

generated in Matlab 6.1 (The MathWorks Inc, Natick, USA)

and later used for presentation of acoustic stimuli with flat

spectra across the frequency range of the stimulus.

The speaker array was combined with an array of 188

white light emitting diodes (LED) mounted in azimuthal steps

107BRAIN RESEARCH 1466 (2012) 99–111

of 1° at eye-level. The LEDs were controlled by a set of 51

printed circuit boards (PCB), which were interfaced with a

desktop PC. Each PCB was assembled with four infra-red (IR)

sensitive phototransistors for the registration of pointing

directions. The phototransistors were arranged with the

same angular distances as the LEDs, but extended beyond

the loudspeaker- and LED array by 8° to both the left and the

right. In combination with the IR-sensitive phototransistors,

the LED array was also used to provide visual feedback of the

angular position pointed to by the subjects. A customized IR-

torch served as pointing device (Solarforce L2 with 3W NVG

LED). The subtended angle of the IR-light beam covered a

maximum of 8° at the level of the LEDs. The mean position

across all activated IR-sensitive phototransistors was com-

puted online and the corresponding LED flashed up as a visual

feedback for the subject.

The loudspeakers and LEDs were hidden behind acousti-

cally transparent gauze, which did not affect visibility of the

LEDs. Thus, subjects were unable to make use of landmarks

during the localization and detection tasks. An infra‐red

camera was installed in the test chamber to monitor subjects’

performance during the experimental sessions. Custom

written MATLAB scripts (R2007b, The MathWorks Inc., Natick,

USA) were used to control stimulus presentation and data

acquisition. Visual and acoustic signals were digitally gener-

ated using RPvdsEx (Real Time Processor Visual Design Studio,

Tucker Davis Technologies, TDT) and delivered to two multi-

channel signal processors (RX8, System3, TDT).

Acoustic stimuli were low-frequency Gaussian noise bursts

(250–1000 Hz) that were presented at 40 dB SL (sensation level).

Sound localization in this low-frequency range is primarily

based on the processing of interaural time differences (ITDs).

Sound motion was simulated by successive activation of

adjacent loudspeakers. To obtain a continuous motion, the

ratio of sound intensity between two adjacent loudspeakers

was adjusted by linear cross-fading of the output voltage. The

level roving (variability in sound intensity around presentation

level) was set to −/+3 dB to avoid adaptation to loudness-related

localization cues. Visual stimuli were light spots at a luminance

of 2.5 lux. Moving visual signals were simulated by successive

activation of adjacent LEDs. The small distance of 1.0° between

two adjacent LEDs was sufficient to generate an apparent

motion percept.

Stimuli were unimodal (acoustic only/visual only) and

concurrent audio-visual motion streams. The constant signal

in moving audio-visual stimuli (attended modality in exper-

iment 1 or constant modality in experiment 2) traveled an

angular range of 38°. The speed of motion, however, varied

with the signal duration of the presented sequence (either

0.5 s or 2.0 s, including 10 ms rise and decay times). The final

positions of the constant signals were located at −8°, −38°,

−60°, +8°, +38°, and +60°, respectively (Fig. 1A). Motion offset

locations at −/+8° were defined as central, at −/+38° as

paracentral and at −/+60° as lateral. The concurrent signal in

the second modality finished either at a congruent location or

spatially and temporally displaced with respect to the final

position of the constant signal (Fig. 1B). Note, that the

predefined spatial disparities (−/+5°, −/+ 10°, −/+ 15°) for 0.5 s

and 2.0 s signals were identical, but temporal disparities co-

varied both with spatial disparity and signal duration (Fig. 1C,

Table 1). Control stimuli (acoustic only or visual only) were

identical to the respective constant signal of audio-visual

signals in terms of trajectory and signal duration. To examine

the possible effect of motion direction on localization perfor-

mance, unimodal and audio-visual stimuli moved towards

the final position from either side (i.e. towards the midline

and towards the periphery, Fig. 1A).

5.3. Study design and procedure

Subjects were tested in complete darkness. Prior to experimen-

tal testing the detection threshold for moving sounds was

obtained for each subject to adjust the presentation level for

acoustic signals during the tests to 40 dB SL. Employing a heard/

not-heard paradigm, the subjects were asked to indicate by a

button press on a response box whether they detected an

acoustic signal (Gaussian noise bursts, 250–1000 Hz) moving

from −38° to 0° in the left hemifield. The initial sound level of

the moving stimulus was set to 63 dB SPL. When the subjects

detected the stimulus, sound level was decreased in steps of

2.5 dB. Otherwise, sound level was increased in equal step sizes.

When the subjects were confident on their individual hearing

threshold, i.e. the minimum sound pressure level which they

required to detect the moving stimulus, they confirmed their

decision by a button press. This reference value was used to set

the acoustic stimulus at 40 dB SL.

Table 1 –Spatial and temporal disparities at motion offset. The indications refer to disparities between the constant signal

(either a moving acoustic or a moving visual signal) and the corresponding variable signal of audio-visual stimuli as

presented in the audio-visual localization task (experiment 1) and the task on detecting spatio-temporal disparities

(experiment 2). Signal duration in the constant modality was either (A) 0.5 s or (B) 2.0 s. Note, that the employed spatial

disparities were identical for both signal durations, but temporal disparities co‐varied both with spatial disparity and signal

duration. In experiment 1 the indications for the constant modality and the variable modality correspond to the attended

modality and the non-attended modality, respectively.

(A) 0.5 s

Spatial disparity −15° −10° −5° 0° +5° + 10° + 15°

Signal duration constant modality 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s 0.5 s

Signal duration variable modality 0.305 s 0.37 s 0.435 s 0.5 s 0.565 s 0.63 s 0.695 s

Temporal disparity −0.195 s −0.13 s −0.065 s 0 s +0.065 s + 0.13 s + 0.195 s

(B) 2.0 s

Spatial disparity −15° −10° −5° 0° +5° + 10° + 15°

Signal duration constant modality 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s 2.0 s

Signal duration variable modality 1.22 s 1.48 s 1.74 s 2.0 s 2.26 s 2.52 s 2.78 s

Temporal disparity −0.78 s −0.52 s −0.26 s 0 s +0.26 s + 0.52 s +0.78 s

108 BRAIN RESEARCH 1466 (2012) 99–111

Each subject participated in two experiments: (i) an audio-

visual localization task and (ii) a detection task in which

moving audio-visual stimuli had to be judged as spatio-

temporally congruent or incongruent.

Both experiments were divided into blocks. Short breaks

were allowed after completion of each block. Each block

started with the presentation of two stationary stimuli that

could be ignored by subjects. Additional 5 stationary stimuli

per block (acoustic, visual or audio-visual) were randomly

interspersed between moving stimuli to avoid adaptation to

motion. Subjects were instructed to look straight ahead

during the trials and not to pursue the moving signals

neither with their eyes nor their head. Unlike in a previous

free-field study in which subjects were asked to fixate an LED

during stimulus presentation (Hofbauer et al., 2004), no

fixation point was provided in the present study to exclude

the possibility that subjects could make use of it as a

reference for the midline position in the darkened test

chamber. The subjects’position was permanently monitored

by the experimenter via video stream from the test chamber.

Trials in which subjects moved their head were excluded

from data analysis.

5.3.1. Audio-visual localization task

Each subject completed a practice run consisting of 30 stimuli

to become familiar with the task and the infra-red pointing

device. During two test sessions, a total of 384 moving

stimulus combinations (336 audio-visual, 24 acoustic, 24

visual) were presented in 16 experimental blocks. An exper-

imental session consisted of 8 blocks. Each stimulus combi-

nation was presented twice. Stimuli within an experimental

block were presented in randomized order. Additionally, the

order of the 16 experimental blocks was counterbalanced

across subjects. In a blocked design, subjects were instructed

to selectively focus their attention on either the acoustic

(‘attend auditory’) or the visual (‘attend visual’) component of

the moving audio-visual stimuli. Instruction on the respective

to-be-attended-modality was given prior to each test block.

Moving unimodal stimuli within an experimental block were

only presented in the modality that corresponded to the

attended modality. Subjects were asked to indicate the

perceived final position of the moving targets in the attended

modality by pointing with an IR-torch. Visual feedback on the

indicated angular position was provided by flashing up the

corresponding LED. To confirm their response, subjects had to

release the button on the IR-torch whereby the designated

LED flashed three times and signalized successful registration

of the angular position. Responses were automatically stored

for subsequent data analyses. The next trial started after an

intertrial-interval (ITI) of 3.0 s, such that subjects were able to

re-orient towards the midline position. A response to each

moving target was required before the next trial could begin.

No feedback on the correct angular position was given at any

time of the test session. Fig. 4 illustrates the experimental

procedure.

5.3.2. Integration and separation of audio-visual motion streams

Each subject completed a practice run consisting of 20 stimuli to

become familiar with the task. Only audio-visual stimulus

combinations (the same as in the audio-visual localization task)

were presented in randomized order in eight experimental

blocks. The order of the eight blocks was counterbalanced across

subjects. Subjects were asked to simultaneously attend to the

acoustic and the visual part of the moving stimulus and to judge

whether motion offsets were spatio-temporally aligned or not

(same-different judgments). After each trial subjects indicated

their response by pressing the corresponding button on a

response box. The ITI between button press and the onset of

the next trial was 2.0 s. A response to each moving audio-visual

stimulus was required before the next trial could begin. No

feedback on the correct response was given at any time of the

test session.

5.4. Data analysis

Subjects’performance did not differ between both hemifields,

so data for each experiment were collapsed for the respective

trajectories in the left and right hemifield.

5.4.1. Audio-visual localization task

Localization bias in the attended modality was quantified as the

difference between the indicated final position and the actual

final position of motion for each stimulus combination. Addi-

tionally, data were normalized with respect to individual unim-

odal localization performance. Normalization compensated for

the overestimation of the final position which is inherent

to motion perception (Hubbard, 2005). Displaying normal-

ized data allowed for the direct comparison of the

magnitude of localization bias between different trajecto-

ries. Stimulus repetitions were averaged for each subject.

ITI

2.0s 2.0s 0.5s 2.0s 0.5s 0.5s 2.0s 0.5s 2.0s 0.5s

0.5s

RR R RRRRRRR

...

Stationary visual stimulus

Moving visual stimulus

Moving sound

Moving audio-visual stimulus

Stationary sound

Stationary audio-visual stimulus

Fig. 4 –Experimental procedure. The symbols illustrate a sequence of different types of moving and stationary stimuli that

were presented in randomized order within an experimental block. The stimulus durationsof signals in the attended modality

(audio-visual localization task) and in the constant modality (detection task on spatio-temporal disparities) are indicated below

the symbols. R = response (experiment 1: localization of the final position of motion in the attended modality using an infra-red

torch, experiment 2: button press in the task on detecting spatio-temporal disparities), ITI = intertrial-interval.

109BRAIN RESEARCH 1466 (2012) 99–111

The results for the respective stimulus combinations were

grouped according to both the attended modality and the

stimulus duration. Statistical analyses were based on

normalized data. Data were submitted to a four-way

repeated measures ANOVA. The included within-subject

factors are specified in Section 2.1.Dataofthetwo

stimulus durations(0.5 s, 2.0 s) were analyzed in separate

ANOVAs.

5.4.2. Integration and separation of audio-visual motion streams

Response behavior was evaluated with regards to the rate of

perceived unity for each stimulus combination. The results

for the respective stimulus combinations were grouped

according to both the constant modality and the stimulus

duration. Data (same/different responses) were nominal vari-

ables and were submitted to a log-linear analysis for

statistical analysis. Log-linear models predict the expected

frequency counts in a contingency table for a two- or more

factorial design. Differences between observed frequencies

and expected frequencies are expressed in one-way associa-

tions (main effects) and two-way and higher-order associa-

tions (interactions). The log-linear data analysis was based on

the number of subjects that reported perceived unity for a

given stimulus combination. Included variables were the

same as in the audio-visual localization task (see Section 2.1).

Separate analyses were conducted for the data of the two

stimulus durations (0.5 s and 2.0 s).

Acknowledgments

This work was supported by the Deutsche Forschungsge-

meinschaft (DFG), graduate program ‘Function of attention in

cognition’. We wish to thank two reviewers for valuable

comments and suggestions. We are also thankful to Ingo

Kannetzky, Jörg Eckebrecht and Matthias Freier for technical

assistance and to Patrice Voss for proofreading of the manuscript.

REFERENCES

Alais, D., Burr, D., 2004. The ventriloquist effect results from

near-optimal bimodal integration. Curr. Biol. 14 (3), 257–262.

Albright, T.D., Stoner, G.R., 1995. Visual motion perception. Proc.

Natl. Acad. Sci. U. S. A. 92 (7), 2433–2440.

Alink, A., Singer, W., Muckli, L., 2008. Capture of auditory motion

by vision is represented by an activation shift from auditory

to visual motion cortex. J. Neurosci. 28 (11), 2690–2697.

Alink, A., Euler, F., Galeano, E., Krugliak, A., Singer, W., Kohler, A.,

2012. Auditory motion capturing ambiguous visual motion.

Front. Psychol. 2, 391.

Allen, P.G., Kolers, P.A., 1981. Sensory specificity of apparent

motion. J. Exp. Psychol. Hum. Percept. Perform. 7 (6), 1318–1328.

Arnal, L.H., Wyart, V., Giraud, A., 2011. Transitions in neural

oscillations reflect prediction errors generated in audiovisual

speech. Nat. Neurosci. 14 (6), 797–801.

Battaglia, P.W., Jacobs, R.A., Aslin, R.N., 2003. Bayesian integration

of visual and auditory signals for spatial localization. J. Opt.

Soc. Am. A Opt. Image Sci. Vis. 20 (7), 1391–1397.

Bertelson, P., Radeau, M., 1981. Cross-modal bias and perceptual

fusion with auditory-visual spatial discordance. Percept.

Psychophys. 29 (6), 578–584.

Bertelson, P., Vroomen, J., de Gelder, B., Driver, J., 2000. The

ventriloquist effect does not depend on the direction of

deliberate visual attention. Percept. Psychophys. 62 (2), 321–332.

Bolognini, N., Frassinetti, F., Serino, A., Làdavas, E., 2005. "Acous-

tical vision" of below threshold stimuli: interaction among

spatially converging audiovisual inputs. Exp. Brain Res. 160 (3),

273–282.

Brooks, A., van der Zwan, R., Billard, A., Petreska, B., Clarke, S.,

Blanke, O., 2007. Auditory motion affects visual biological

motion processing. Neuropsychologia 45 (3), 523–530.

Burr, D., Alais, D., 2006. Combining visual and auditory

information. Prog. Brain Res. 155, 243–258.

Calvert, G.A., Hansen, P.C., Iversen, S.D., Brammer, M.J., 2001.

Detection of audio-visual integration sites in humans by

application of electrophysiological criteria to the BOLD effect.

Neuroimage 14 (2), 427–438.

Colonius, H., Diederich, A., 2010. The optimal time window of

visual–auditory integration: a reaction time analysis. Front.

Integr. Neurosci. 4, 11.

Ernst, M.O., Banks, M.S., 2002. Humans integrate visual and

haptic information in a statistically optimal fashion.

Nature 415 (6870), 429–433.

Ernst, M.O., Bülthoff, H.H., 2004. Merging the senses into a robust

percept. Trends Cogn. Sci. 8 (4), 162–169.

Finlay, D., 1982. Motion perception in the peripheral visual field.

Perception 11 (4), 457–462.

Freeman, E., Driver, J., 2008. Direction of visual apparent motion

driven solely by timing of a static sound. Curr. Biol. 18 (16),

1262–1266.

Getzmann, S., 2007. The effect of brief auditory stimuli on visual

apparent motion. Perception 36 (7), 1089–1103.

Grantham, D.W., 1986. Detection and discrimination of simulated

motion of auditory targets in the horizontal plane. J. Acoust.

Soc. Am. 79 (6), 1939–1949.

Hairston, W.D., Wallace, M.T., Vaughan, J.W., Stein, B.E., Norris,

J.L., Schirillo, J.A., 2003. Visual localization ability influences

cross-modal bias. J. Cogn. Neurosci. 15 (1), 20–29.

Hidaka, S., Teramoto, W., Sugita, Y., Manaka, Y., Sakamoto, S.,

Suzuki, Y., 2011. Auditory motion information drives visual

motion perception. PLoS One 6 (3), e17499.

Hofbauer, M., Wuerger, S.M., Meyer, G.F., Roehrbein, F., Schill, K.,

Zetzsche, C., 2004. Catching audiovisual mice: predicting the

arrival time of auditory–visual motion signals. Cogn. Affect.

Behav. Neurosci. 4 (2), 241–250.

Hubbard, T.L., 2005. Representational momentum and related

displacements in spatial memory: a review of the findings.

Psychon. Bull. Rev. 12 (5), 822–851.

Jain, A., Sally, S.L., Papathomas, T.V., 2008. Audiovisual

short-term influences and aftereffects in motion: examination

across three sets of directional pairings. J. Vis. 8 (15), 7.1–7.13.

Kafaligonul, H., Stoner, G.R., 2010. Auditory modulation of visual

apparent motion with short spatial and temporal intervals.

J. Vis. 10 (12), 31.

King, A.J., 2005. Multisensory integration: strategies for

synchronization. Curr. Biol. 15 (9), R339–R341.

Kitajima, N., Yamashita, Y., 1999. Dynamic capture of sound

motion by light stimuli moving in three-dimensional space.

Percept. Mot. Skills 89 (3 Pt 2), 1139–1158.

Lewald, J., Guski, R., 2003. Cross-modal perceptual integration

of spatially and temporally disparate auditory and visual

stimuli. Brain Res. Cogn. Brain Res. 16 (3), 468–478.

Lewis, R., Noppeney, U., 2010. Audiovisual synchrony improves

motion discriminationvia enhanced connectivitybetween early

visual and auditory areas. J. Neurosci. 30 (37), 12329–12339.

Mateeff, S., Hohnsbein, J., Noack, T., 1985. Dynamic visual capture:

apparent auditory motion induced by a moving visual target.

Perception 14 (6), 721–727.

McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices.

Nature 264 (5588), 746–748.

110 BRAIN RESEARCH 1466 (2012) 99–111

McKee, S.P., Nakayama, K., 1984. The detection of motion in the

peripheral visual field. Vision Res. 24 (1), 25–32.

Meredith, M.A., Stein, B.E., 1983. Interactions among converging

sensory inputs in the superior colliculus. Science 221 (4608),

389–391.

Meredith, M.A., Nemitz, J.W., Stein, B.E., 1987. Determinants of

multisensory integration in superior colliculus neurons. I.

temporal factors. J. Neurosci. 7 (10), 3215–3229.

Meyer, G.F., Wuerger, S.M., 2001. Cross-modal integration of auditory

and visual motion signals. Neuroreport 12 (11), 2557–2560.

Meyer, G.F., Wuerger, S.M., Röhrbein, F., Zetzsche, C., 2005.

Low-level integration of auditory and visual motion signals

requires spatial co-localisation. Exp. Brain Res. 166 (3–4), 538–547.

Meyer, G.F., Greenlee, M., Wuerger, S., 2011. Interactions between

auditory and visual semantic stimulus classes: evidence for

common processing networks for speech and body actions.

J. Cogn. Neurosci. 23 (9), 2291–2308.

Middlebrooks, J.C., Green, D.M., 1991. Sound localization by

human listeners. Annu. Rev. Psychol. 42, 135–159.

Morein-Zamir, S., Soto-Faraco, S., Kingstone, A., 2003. Auditory

capture of vision: examining temporal ventriloquism. Brain

Res. Cogn. Brain Res. 17 (1), 154–163.

Navarra, J., Vatakis, A., Zampini, M., Soto-Faraco, S., Humphreys,

W., Spence, C., 2005. Exposure to asynchronous audiovisual

speech extends the temporal window for audiovisual

integration. Brain Res. Cogn. Brain Res. 25 (2), 499–507.

Perrott, D.R., Ambarsoom, H., Tucker, J., 1987. Changes in head

position as a measure of auditory localization performance:

auditory psychomotor coordination under monaural and

binaural listening conditions. J. Acoust. Soc. Am. 82 (5),

1637–1645.

Perrott, D.R., Saberi, K., Brown, K., Strybel, T.Z., 1990. Auditory

psychomotor coordination and visual search performance.

Percept. Psychophys. 48 (3), 214–226.

Recanzone, G.H., 2009. Interactions of auditory and visual stimuli

in space and time. Hear. Res. 258 (1–2), 89–99.

Sanabria, D., Lupiañez, J., Spence, C., 2007a. Auditory motion

affects visual motion perception in a speeded discrimination

task. Exp. Brain Res. 178 (3), 415–421.

Sanabria, D., Spence, C., Soto-Faraco, S., 2007b. Perceptual and

decisional contributions to audiovisual interactions in the

perception of apparent motion: a signal detection study.

Cognition 102 (2), 299–310.

Senkowski, D., Talsma, D., Grigutsch, M., Herrmann, C.S.,

Woldorff, M.G., 2007. Good times for multisensory integration:

effects of the precision of temporal synchrony as revealed by

gamma-band oscillations. Neuropsychologia 45 (3), 561–571.

Slutsky, D.A., Recanzone, G.H., 2001. Temporal and spatial

dependency of the ventriloquism effect. Neuroreport 12 (1),

7–10.

Soto-Faraco, S., Lyons, J., Gazzaniga, M., Spence, C., Kingstone, A.,

2002. The ventriloquist in motion: illusory capture of dynamic

information across sensory modalities. Brain Res. Cogn. Brain

Res. 14 (1), 139–146.

Soto-Faraco, S., Spence, C., Lloyd, D., Kingstone, A., 2004. Moving

multisensory research along: motion perception across

sensory modalities. Curr. Dir. Psychol. Sci. 13, 29–32.

Spence, C., Senkowski, D., Röder, B., 2009. Crossmodal processing.

Exp. Brain Res. 198 (2–3), 107–111.

Stein, B.E., Meredith, M.A., 1993. The Merging of the Senses. MIT

Press, Cambridge, MA.

Stekelenburg, J.J., Vroomen, J., 2009. Neural correlates of

audiovisual motion capture. Exp. Brain Res. 198 (2–3),

383–390.

Strybel, T.Z., Vatakis, A., 2004. A comparison of auditory and

visual apparent motion presented individually and with

crossmodal moving distractors. Perception 33 (9), 1033–1048.

Teder-Sälejärvi, W.A., Di Russo, F., McDonald, J.J., Hillyard, S.A.,

2005. Effects of spatial congruity on audio-visual multimodal

integration. J. Cogn. Neurosci. 17 (9), 1396–1409.

Thurlow, W.R., Jack, C.E., 1973. Certain determinants of the

"ventriloquism effect". Percept. Mot. Skills 36 (3), 1171–1184.

To, M.P.S., Regan, B.C., Wood, D., Mollon, J.D., 2011. Vision out of

the corner of the eye. Vision Res. 51 (1), 203–214.

Van Wanrooij, M.M., Bremen, P., John Van Opstal, A., 2010.

Acquired prior knowledge modulates audiovisual integration.

Eur. J. Neurosci. 31 (10), 1763–1771.

Wallace, M.T., Roberson, G.E., Hairston, W.D., Stein, B.E., Vaughan,

J.W., Schirillo, J.A., 2004. Unifying multisensory signals across

time and space. Exp. Brain Res. 158 (2), 252–258.

Welch, R.B., Warren, D.H., 1980. Immediate perceptual response to

intersensory discrepancy. Psychol. Bull. 88 (3), 638–667.

Werner, S., Noppeney, U., 2010. Superadditive responses in

superior temporal sulcus predict audiovisual benefits in object

categorization. Cereb. Cortex 20 (8), 1829–1842.

Werner, S., Noppeney, U., 2011. The contributions of transient and

sustained response codes to audiovisual integration. Cereb.

Cortex 21 (4), 920–931.

Wilson, W.W., O'Neill, W.E., 1998. Auditory motion induces

directionally dependent receptive field shifts in inferior

colliculus neurons. J. Neurophysiol. 79 (4), 2040–2062.

Wuerger, S.M., Hofbauer, M., Meyer, G.F., 2003. The integration of

auditory and visual motion signals at threshold. Percept.

Psychophys. 65 (8), 1188–1196.

111BRAIN RESEARCH 1466 (2012) 99–111

Auditory cues facilitate object movement processing in human extrastriate visual cortex during simulated self-motion: A pilot study

Article

Apr 2021
BRAIN RES

Visual segregation of moving objects is a considerable computational challenge when the observer moves through space. Recent psychophysical studies suggest that directionally congruent, moving auditory cues can substantially improve parsing object motion in such settings, but the exact brain mechanisms and visual processing stages that mediate these effects are still incompletely known. Here, we utilized multivariate pattern analyses (MVPA) of MRI-informed magnetoencephalography (MEG) source estimates to examine how crossmodal auditory cues facilitate motion detection during the observer's self-motion. During MEG recordings, participants identified a target object that moved either forward or backward within a visual scene that included nine identically textured objects simulating forward observer translation. Auditory motion cues 1) improved the behavioral accuracy of target localization, 2) significantly modulated the MEG source activity in the areas V2 and human middle temporal complex (hMT+), and 3) increased the accuracy at which the target movement direction could be decoded from hMT+ activity using MVPA. The increase of decoding accuracy by auditory cues in hMT+ was significant also when superior temporal activations in or near auditory cortices were regressed out from the hMT+ source activity to control for source estimation biases caused by point spread. Taken together, these results suggest that parsing object motion from self-motion-induced optic flow in the human extrastriate visual cortex can be facilitated by crossmodal influences from auditory system.

Design of Audio-Augmented-Reality-Based O&M Orientation Training for Visually Impaired Children

Article

Full-text available

Dec 2022
SENSORS-BASEL

Orientation and Mobility training (O&M) is a specific program that teaches people with vision loss to orient themselves and travel safely within certain contexts. State-of-the-art research reveals that people with vision loss expect high-quality O&M training, especially at early ages, but the conventional O&M training methods involve tedious programs and require a high participation of professional trainers. However, there is an insufficient number of excellent trainers. In this work, we first interpret and discuss the relevant research in recent years. Then, we discuss the questionnaires and interviews we conducted with visually impaired people. On the basis of field investigation and related research, we propose the design of a training solution for children to operate and maintain direction based on audio augmented reality. We discuss how, within the perceptible scene created by EasyAR’s map-aware framework, we created an AR audio source tracing training that simulates a social scene to strengthen the audiometric identification of the subjects, and then to verify the efficiency and feasibility of this scheme, we implemented the application prototype with the required hardware and software and conducted the subsequential experiments with blindfolded children. We confirm the high usability of the designed approach by analyzing the results of the pilot study. Compared with other orientation training studies, the method we propose makes the whole training process flexible and entertaining. At the same time, this training process does not involve excessive economic costs or require professional skills training, allowing users to undergo training at home or on the sports ground rather than having to go to rehabilitation sites or specified schools. Furthermore, according to the feedback from the experiments, the approach is promising in regard to gamification.

Tau and kappa in interception – how perceptual spatiotemporal interrelations affect movements

Article

Full-text available

Jun 2022

Batting and catching are real-life examples of interception. Due to latencies between the processing of sensory input and the corresponding motor response, successful interception requires accurate spatiotemporal prediction. However, spatiotemporal predictions can be subject to bias. For instance, the more spatially distant two sequentially presented objects are, the longer the interval between their presentations is perceived (kappa effect) and vice versa (tau effect). In this study, we deployed these phenomena to test in two sensory modalities whether temporal representations depend asymmetrically on spatial representations, or whether both are symmetrically interrelated. We adapted the tau and kappa paradigms to an interception task by presenting four stimuli (visually or auditorily) one after another on four locations, from left to right, with constant spatial and temporal intervals in between. In two experiments, participants were asked to touch the screen where and when they predicted a fifth stimulus to appear. In Exp. 2, additional predictive gaze measures were examined. Across experiments, auditory but not visual stimuli produced a tau effect for interception, supporting the idea that the relationship between space and time is moderated by the sensory modality. Results did not reveal classical auditory or visual kappa effects and no visual tau effects. Gaze data in Exp. 2 showed that the (spatial) gaze orientation depended on temporal intervals while the timing of fixations was modulated by spatial intervals, thereby indicating tau and kappa effects across modalities. Together, the results suggest that sensory modality plays an important role in spatiotemporal predictions in interception.

A Systematic Review of Individual Differences in Perception, Action, and Decision Making: Implications for Dance Education

Article

Mar 2022

This systematic bibliometric review summarizes recent neurocognitive research highlighting inter-individual differences in perception, action, and decision making that may have implications for dance education. First, the relevance of individual differences in cognitive functioning for dance education is illustrated by describing how a person’s preferred reliance on certain perceptual, motor, or (meta)cognitive skills may be exploited in dance training. Subsequently, we describe the findings of a literature search conducted to identify cognitive neuroscientific publications between 2010 and 2021 that highlight individual differences in cognitive functions that were also found to be supported by structural or functional-connectivity differences in the central nervous system. To cluster the findings of the literature search, we propose a simplified, six-category information processing model. Finally, for each model category, we summarize recent representative findings on salient individual differences and tentatively formulate testable implications for dance education practice with regard to pedagogical and curricula adaptations. Finally, the review also delineates an agenda for lines of research of which the results hopefully will assist dance instructors in the future.

The effect of movement speed on audiovisual temporal integration in streaming-bouncing illusion

Article

Full-text available

Apr 2022
EXP BRAIN RES

Motion perception in real situations is often stimulated by multisensory information. Speed is an essential characteristic of moving objects; however, at present, it is not clear whether speed affects the process of audiovisual temporal integration in motion perception. Therefore, this study used a streaming-bouncing task (a bistable motion perception; SB task) combined with a simultaneous judgment task (SJ task) to explore the effect of speed on audiovisual temporal integration from implicit and explicit perspectives. The experiment had a within-subjects design, two speed conditions (fast/slow), eleven audiovisual conditions [stimulus onset asynchrony (SOA): 0 ms/ ± 60 ms/ ± 120 ms/ ± 180 ms/ ± 240 ms/ ± 300 ms], and a visual-only condition. A total of 30 subjects were recruited for the study. These participants completed the SB task and the SJ task successively. The results showed the following outcomes: (1) the optimal times needed to induce the “bouncing” illusion and maximum audiovisual bounce-inducing effect (ABE) magnitude were much earlier than that for the optimal time of audiovisual synchrony, (2) speed as a bottom-up factor could affect the proportion of “bouncing” perception in SB illusions but did not affect the ABE magnitude, (3) speed could also affect the ability of audiovisual temporal integration in motion perception, and the main manifestation was that the point of subjective simultaneity (PSS) in fast speed conditions was earlier than that of slow speed conditions in the SJ task and (4) the SB task and SJ task were not related. In conclusion, the time to complete the maximum audiovisual integration was different from the optimal time for synchrony perception; moreover, speed could affect audiovisual temporal integration in motion perception but only in explicit temporal tasks.

Visual Looming Influences Loudness Change: A Preliminary Investigation

Article

Jan 2022

Looming objects generate unique multisensory information and can signal potential danger. While visual judgments of arrival time are relatively accurate, auditory judgments tend to be anticipatory. When presented with information from both modalities, observers tend to rely on visual information for successful interaction. Previous work has shown that auditory information can influence perception of visual looming and receding. Here, we examined how visual information affects the auditory perception of looming and receding sounds. Participants judged the loudness change of sounds accompanied by visual looming and receding stimuli. We found that looming and receding visual stimuli influenced estimates of loudness change. Sounds were perceived to change more in loudness when presented with a larger visual change. We also found that looming sounds were perceived to change more in loudness than equivalent receding sounds and that participants showed better discrimination of the change between receding sounds than looming sounds.

Practicing Phonomimetic (Conducting-like) Gestures Facilitates Vocal Performance in Typically Developing Children and Children with Autism: An Experimental Study

Thesis

Full-text available

Feb 2021

Emelyne Bingham

Cross Modal Video Representations for Weakly Supervised Active Speaker Localization

Article

Jan 2022

An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. Furthermore, we propose an audio-assisted post-processing formulation for the task of active speaker detection. We evaluate the performance of the proposed system on three benchmark datasets: i) AVA active speaker dataset, ii) Visual person clustering dataset, and iii) Columbia datset, and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems.

Effects of Audiovisual Presentations on Visual Localization Errors: One or Several Multisensory Mechanisms?

Article

Apr 2021

The present study examines the extent to which temporal and spatial properties of sound modulate visual motion processing in spatial localization tasks. Participants were asked to locate the place at which a moving visual target unexpectedly vanished. Across different tasks, accompanying sounds were factorially varied within subjects as to their onset and offset times and/or positions relative to visual motion. Sound onset had no effect on the localization error. Sound offset was shown to modulate the perceived visual offset location, both for temporal and spatial disparities. This modulation did not conform to attraction toward the timing or location of the sounds but, demonstrably in the case of temporal disparities, to bimodal enhancement instead. Favorable indications to a contextual effect of audiovisual presentations on interspersed visual-only trials were also found. The short sound-leading offset asynchrony had equivalent benefits to audiovisual offset synchrony, suggestive of the involvement of early-level mechanisms, constrained by a temporal window, at these conditions. Yet, we tentatively hypothesize that the whole of the results and how they compare with previous studies requires the contribution of additional mechanisms, including learning-detection of auditory-visual associations and cross-sensory spread of endogenous attention.

Tactile temporal offset cues reduce visual representational momentum

Article

Full-text available

Mar 2021

The perception of dynamic objects is sometimes biased. For example, localizing a moving object after it has disappeared results in a perceptual shift in the direction of motion, a bias known as representational momentum . We investigated whether the temporal characteristics of an irrelevant, spatially uninformative vibrotactile stimulus bias the perceived location of a visual target. In two visuotactile experiments, participants judged the final location of a dynamic, visual target. Simultaneously, a continuous (starting with the onset of the visual target, Experiments 1 and 2) or brief (33-ms stimulation, Experiment 2) vibrotactile stimulus (at the palm of participant’s hands) was presented, and the offset disparity between the visual target and tactile stimulation was systematically varied. The results indicate a cross-modal influence of tactile stimulation on the perceived final location of the visual target. Closer inspection of the nature of this cross-modal influence, observed here for the first time, reveals that the vibrotactile stimulus was likely just taken as a temporal cue regarding the offset of the visual target, but no strong interaction and combined processing of the two stimuli occurred. The present results are related to similar cross-modal temporal illusions and current accounts of multisensory perception, integration, and cross-modal facilitation.

Auditory Motion Capturing Ambiguous Visual Motion

Article

Full-text available

Jan 2012

In this study, it is demonstrated that moving sounds have an effect on the direction in which one sees visual stimuli move. During the main experiment sounds were presented consecutively at four speaker locations inducing left or rightward auditory apparent motion. On the path of auditory apparent motion, visual apparent motion stimuli were presented with a high degree of directional ambiguity. The main outcome of this experiment is that our participants perceived visual apparent motion stimuli that were ambiguous (equally likely to be perceived as moving left or rightward) more often as moving in the same direction than in the opposite direction of auditory apparent motion. During the control experiment we replicated this finding and found no effect of sound motion direction on eye movements. This indicates that auditory motion can capture our visual motion percept when visual motion direction is insufficiently determinate without affecting eye movements.

Auditory Motion Induces Directionally Dependent Receptive Field Shifts in Inferior Colliculus Neurons

Article

Full-text available

Apr 1998

This research focused on the response of neurons in the inferior colliculus of the unanesthetized mustached bat, Pteronotus parnelli, to apparent auditory motion. We produced the apparent motion stimulus by broadcasting pure-tone bursts sequentially from an array of loudspeakers along horizontal, vertical, or oblique trajectories in the frontal hemifield. Motion direction had an effect on the response of 65% of the units sampled. In these cells, motion in opposite directions produced shifts in receptive field locations, differences in response magnitude, or a combination of the two effects. Receptive fields typically were shifted opposite the direction of motion (i.e., units showed a greater response to moving sounds entering the receptive field than exiting) and shifts were obtained to horizontal, vertical, and oblique motion orientations. Response latency also shifted as a function of motion direction, and stimulus locations eliciting greater spike counts also exhibited the shortest neural latency. Motion crossing the receptive field boundaries appeared to be both necessary and sufficient to produce receptive field shifts. Decreasing the silent interval between successive stimuli in the apparent motion sequence increased both the probability of obtaining a directional effect and the magnitude of receptive field shifts. We suggest that the observed directional effects might be explained by "spatial masking," where the response of auditory neurons after stimulation from particularly effective locations in space would be diminished. The shift in auditory receptive fields would be expected to shift the perceived location of a moving sound and may explain shifts in localization of moving sources observed in psychophysical studies. Shifts in perceived target location caused by auditory motion might be exploited by auditory predators such as Pteronotus in a predictive tracking strategy to capture moving insect prey.

Bayesian integration of visual and auditory signals for spatial localization

Article

Jul 2003

Human observers localize events in the world by using sensory signals from multiple modalities. We evaluated two theories of spatial localization that predict how visual and auditory information are weighted when these signals specify different locations in space. According to one theory (visual capture), the signal that is typically most reliable dominates in a winner-take-all competition, whereas the other theory (maximum-likelihood estimation) proposes that perceptual judgments are based on a weighted average of the sensory signals in proportion to each signal's relative reliability. Our results indicate that both theories are partially correct, in that relative signal reliability significantly altered judgments of spatial location, but these judgments were also characterized by an overall bias to rely on visual over auditory information. These results have important implications for the development of cue integration and for neural plasticity in the adult brain that enables humans to optimally integrate multimodal information.

The Merging of The Senses

Book

Jan 1993
NATURE

Moving Multisensory Research AlongMotion Perception Across Sensory Modalities

Article

Feb 2004

The past few years have seen a rapid growth of interest regarding how information from the different senses is combined. Historically, the majority of the research on this topic has focused on interactions in the perception of stationary stimuli, but given that the majority of stimuli in the world move, an important question concerns the extent to which principles derived from stationary stimuli also apply to moving stimuli. A key finding emerging from recent work with moving stimuli is that our perception of stimulus movement in one modality is frequently, and unavoidably, modulated by the concurrent movement of stimuli in other sensory modalities. Visual motion has a particularly strong influence on the perception of auditory and tactile motion. These behavioral results are now being complemented by the results of neuroimaging studies that have pointed out the existence of both modality-specific motion-processing areas and areas involved in processing motion in more than one sense. The challenge for the future will be to develop novel experimental paradigms that can integrate behavioral and neuroscientific approaches in order to refine our understanding of multisensory contributions to the perception of movement.

The ventriloquist effect does not depend on the direction of deliberate visual attention

Article

Apr 2012

It is well known that discrepancies in the location of synchronized auditory and visual events can lead to mislocalizations of the auditory source, so-called ventriloquism. In two experiments, we tested whether such cross-modal influences on auditory localization depend on deliberate visual attention to the biasing visual event. In Experiment 1, subjects pointed to the apparent source of sounds in the presence or absence of a synchronous peripheral flash. They also monitored for target visual events, either at the location of the peripheral flash or in a central location. Auditory localization was attracted toward the synchronous peripheral flash, but this was unaffected by where deliberate visual attention was directed in the monitoring task. In Experiment 2, bilateral flashes were presented in synchrony with each sound, to provide competing visual attractors. When these visual events were equally salient on the two sides, auditory localization was unaffected by which side subjects monitored for visual targets. When one flash was larger than the other, auditory localization was slightly but reliably attracted toward it, but again regardless of where visual monitoring was required. We conclude that ventriloquism largely reflects automatic sensory interactions, with little or no role for deliberate spatial attention.

Auditory psychomotor coordination and visual search performance

Article

May 1990

In Experiments 1 and 2, the time to locate and identify a visual target (visual search performance in a two-alternative forced-choice paradigm) was measured as a function of the location of the target relative to the subject’s initial line of gaze. In Experiment 1, tests were conducted within a 260° region on the horizontal plane at a fixed elevation (eye level). In Experiment 2, the position of the target was varied in both the horizontal (260°) and the vertical (±46° from the initial line of gaze) planes. In both experiments, and for all locations tested, the time required to conduct a visual search was reduced substantially (175–1,200 msec) when a 10-Hz click train was presented from the same location as that occupied by the visual target. Significant differences in latencies were still evident when the visual target was located within 10° of the initial line of gaze (central visual field). In Experiment 3, we examined head and eye movements that occur as subjects attempt to locate a sound source. Concurrent movements of the head and eyes are commonly encountered during auditorily directed search behavior. In over half of the trials, eyelid closures were apparent as the subjects attempted to orient themselves toward the sound source. The results from these experiments support the hypothesis that the auditory spatial channel has a significant role in regulating visual gaze.

Cross-modal bias and perceptual fusion with auditory-visual spatial discordance

Article

Nov 1981

Investigations of situations involving spatial discordance between auditory and visual data which can otherwise be attributed to a common origin have revealed two main phenomena:cross-modal bias andperceptual fusion (or ventriloquism). The focus of the present study is the relationship between these two. The question asked was whether bias occurred only with fusion, as is predicted by some accounts of reactions to discordance, among them those based on cuesubstitution. The approach consisted of having subjects, on each trial, both point to signals in one modality in the presence of conflicting signals in the other modality and produce same-different origin judgments. To avoid the confounding of immediate effects with cumulative adaptation, which was allowed in most previous studies, the direction and amplitude of discordance was varied randomly from trial to trial. Experiment 1, which was a pilot study, showed that both visual bias of auditory localization and auditory bias of visual localization can be observed under such conditions. Experiment 2, which addressed the main question, used a method which controls for the selection involved in separating fusion from no-fusion trials and showed that the attraction of auditory localization by conflicting visual inputs occurs even when fusion is not reported. This result is inconsistent with purely postperceptual views of cross-modal interactions. The question could not be answered for auditory bias of visual localization, which, although significant, was very small in Experiment 1 and fell below significance under the conditions of Experiment 2.

Interactions of Auditory and Visual Stimuli in Space and Time

Article

Dec 2009

Gregg H Recanzone

The nervous system has evolved to transduce different types of environmental energy independently, for example light energy is transduced by the retina whereas sound energy is transduced by the cochlea. However, the neural processing of this energy is necessarily combined, resulting in a unified percept of a real-world object or event. These percepts can be modified in the laboratory, resulting in illusions that can be used to probe how multisensory integration occurs. This paper reviews studies that have utilized such illusory percepts in order to better understand the integration of auditory and visual signals in primates. Results from human psychophysical experiments where visual stimuli alter the perception of acoustic space (the ventriloquism effect) are discussed, as are experiments probing the underlying cortical mechanisms of this integration. Similar psychophysical experiments where auditory stimuli alter the perception of visual temporal processing are also described.

Detection of Audio-Visual Integration Sites in Humans by Application of Electrophysiological Criteria to the BOLD Effect

Article

Sep 2001

Electrophysiological studies in nonhuman primates and other mammals have shown that sensory cues from different modalities that appear at the same time and in the same location can increase the firing rate of multisensory cells in the superior colliculus to a level exceeding that predicted by summing the responses to the unimodal inputs. In contrast, spatially disparate multisensory cues can induce a profound response depression. We have previously demonstrated using functional magnetic resonance imaging (fMRI) that similar indices of crossmodal facilitation and inhibition are detectable in human cortex when subjects listen to speech while viewing visually congruent and incongruent lip and mouth movements. Here, we have used fMRI to investigate whether similar BOLD signal changes are observable during the crossmodal integration of nonspeech auditory and visual stimuli, matched or mismatched solely on the basis of their temporal synchrony, and if so, whether these crossmodal effects occur in similar brain areas as those identified during the integration of audio-visual speech. Subjects were exposed to synchronous and asynchronous auditory (white noise bursts) and visual (B/W alternating checkerboard) stimuli and to each modality in isolation. Synchronous and asynchronous bimodal inputs produced superadditive BOLD response enhancement and response depression across a large network of polysensory areas. The most highly significant of these crossmodal gains and decrements were observed in the superior colliculi. Other regions exhibiting these crossmodal interactions included cortex within the superior temporal sulcus, intraparietal sulcus, insula, and several foci in the frontal lobe, including within the superior and ventromedial frontal gyri. These data demonstrate the efficacy of using an analytic approach informed by electrophysiology to identify multisensory integration sites in humans and suggest that the particular network of brain areas implicated in these crossmodal integrative processes are dependent on the nature of the correspondence between the different sensory inputs (e.g. space, time, and/or form).

Crossmodal interactions and multisensory integration in the perception of audio-visual motion - A free-field study

Abstract

Recommended publications

Effects of stimulus field size and coherence of visual motion on cortical responses in humans: An ME...

Global Motion Processing in Human Visual Cortical Areas V2 and V3

ERP evidence for crossmodal interactions during the encoding of audio–visual motion offsets

Transient blindness to disparity defined depth