ArticlePDF Available

Simulation of talking faces in the human brain improves auditory speech recognition

Authors:

Abstract and Figures

Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. • fMRI • multisensory • predictive coding • prosopagnosia
Content may be subject to copyright.
Simulation of talking faces in the human brain
improves auditory speech recognition
Katharina von Kriegstein*
†‡
,O
¨
zgu¨ r Dogan
§
, Martina Gru¨ ter
, Anne-Lise Giraud
, Christian A. Kell
§
, Thomas Gru¨ ter
,
Andreas Kleinschmidt**
††
, and Stefan J. Kiebel*
*Wellcome Trust Centre for Neuroimaging, University College London, Queen Square, London WC1N 3BG, United Kingdom;
Medical School, University of
Newcastle, Framlington Place, Newcastle-upon-Tyne NE2 4HH, United Kingdom;
§
Department of Neurology, J. W. Goethe University, Schleusenweg, 60528
Frankfurt am Main, Germany;
Department of Psychological Basic Research, University of Vienna, Liebiggasse, 1010 Vienna, Austria;
Departement d’Etudes
Cognitives, E
´
cole Normale Supe´ rieure, 75005 Paris, France; **CEA, NeuroSpin, 91401 Gif-sur-Yvette, France; and
††
Institut National de la Sante´ etdela
Recherche Me´ dicale, U562, 91401 Gif-sur-Yvette, France
Edited by Dale Purves, Duke University Medical Center, Durham, NC, and approved March 15, 2008 (received for review November 15, 2007)
Human face-to-face communication is essentially audiovisual. Typ-
ically, people talk to us face-to-face, providing concurrent auditory
and visual input. Understanding someone is easier when there is
visual input, because visual cues like mouth and tongue move-
ments provide complementary information about speech content.
Here, we hypothesized that, even in the absence of visual input,
the brain optimizes both auditory-only speech and speaker recog-
nition by harvesting speaker-specific predictions and constraints
from distinct visual face-processing areas. To test this hypothesis,
we performed behavioral and neuroimaging experiments in two
groups: subjects with a face recognition deficit (prosopagnosia)
and matched controls. The results show that observing a specific
person talking for 2 min improves subsequent auditory-only
speech and speaker recognition for this person. In both prosop-
agnosics and controls, behavioral improvement in auditory-only
speech recognition was based on an area typically involved in
face-movement processing. Improvement in speaker recognition
was only present in controls and was based on an area involved in
face-identity processing. These findings challenge current unisen-
sory models of speech processing, because they show that, in
auditory-only speech, the brain exploits previously encoded au-
diovisual correlations to optimize communication. We suggest that
this optimization is based on speaker-specific audiovisual internal
models, which are used to simulate a talking face.
fMRI multisensory predictive coding prosopagnosia
H
uman face-to-face commun ication works best when one can
watch the speaker’s face (1). This becomes obvious when
someone speaks to us in a noisy environment, in which the
auditory speech signal is degraded. Visual cues place constraints
on what our brain ex pects to perceive in the auditory channel.
These visual c onstraints improve the recognition rate for audio-
visual speech, c ompared with auditory speech alone (2). Simi-
larly, speaker identity recogn ition by voice can be improved by
c oncurrent visual infor mation (3). Accordingly, audiovisual
models of human voice and face perception posit that there are
interactions bet ween auditory and visual processing streams
(Fig. 1A) (4, 5).
Based on prior ex perimental (68) and theoretical work
(9–12) we hypothesized that, even in the absence of visual input,
the brain optimizes auditory-only speech and speaker recogni-
tion by harvesting predictions and constraints from distinct
visual face areas (Fig. 1B).
Ex perimental studies (6, 8) demonstrated that the identifica-
tion of a speaker by voice is improved after a brief audiovisual
ex perience with that speaker (in contrast to a matched control
c ondition). The improvement ef fect was paralleled by an inter-
action of voice and face-identity sensitive areas (8). This finding
suggested that the associative represent ation of a particular face
facilit ates the recognition of that person by voice. However, it is
unclear whether this effect also extends to other audiovisual
dependencies in human communication. Such a finding, for
example in the case of speech recognition, would indicate that
the brain fills-in missing information routinely to make auditory
c ommunication more robust.
To test this hypothesis, we asked the following question: What
does the brain do when we listen to someone whom we have
previously seen talk ing? Classical speech processing models (the
‘‘auditory-only’’ model) predict that the brain uses auditory-only
processing capabilities to recognize speech and speaker (13, 14).
Under the ‘‘audiovisual’’ model, we posit that the brain uses
previously learned audiovisual speaker-specific information to
improve rec ognition of both speech and speaker (Fig. 1 B). Even
without visual input, face-processing areas could use enc oded
k nowledge about the visual orofacial k inetics of talking and
simulate a specific speaker to make predictions about the
trajectory of what is heard. This v isual online simulation would
place useful c onstraints on auditory perception to improve
speech recognition by resolv ing auditory ambiguities. This c on-
str uctivist view of perception has proved useful in understanding
human vision (15, 16) and may be even more powerful in the
c ontext of integration of prior multimodal information. To
identif y such a mechanism in speech perception would not only
have immediate implications for the ecological validity of audi-
tory-only models of speech perception but would also point to a
general principle of how the brain c opes with noisy and missing
infor mation in human communication.
Speech and speaker recogn ition largely rest on two different
sets of audiovisual correlations. Speech recognition is based
predominantly on fast time-varying acoustic cues produced by
the varying vocal tract shape, i.e., orofacial movements (17, 18).
Conversely, speaker rec ognition uses predominantly time-
invariant properties of the speech signal, such as the ac oustic
properties of the vocal tract length (19). If the brain uses stored
visual information for processing auditory-only speech, the
relative improvement in speech and speaker recognition could,
therefore, be behav iorally and neuroanatomically dissociable. To
investigate this potential dissociation, we recruited prosopagno-
sics who have impaired perception of face identit y but seem to
have intact perception of orofacial movements (20).
Neurophysiological face processing studies indicate that dis-
tinct brain areas are specialized for processing time-varying
infor mation [facial movements, superior temporal sulcus (STS)
Author contributions: K.v.K. designed research; K.v.K., O
¨
.D., M.G., A.-L.G., C.A.K., T.G., and
A.K. performed research; S.J.K. contributed new reagents/analytic tools; K.v.K. and O
¨
.D.
analyzed data; and K.v.K. and S.J.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
To whom correspondence should be addressed. E-mail: kkriegs@fil.ion.ucl.ac.uk.
This article contains supporting information online at www.pnas.org/cgi/content/full/
0710826105/DCSupplemental.
© 2008 by The National Academy of Sciences of the USA
www.pnas.orgcgidoi10.1073pnas.0710826105 PNAS
May 6, 2008
vol. 105
no. 18
6747–6752
NEUROSCIENCE
(21, 22), and time-const ant information (face identity, fusifor m
face area (FFA) (23–25)] (26, 27). If speech and speaker
rec ognition are neuroanatomically dissociable, and the improve-
ment by audiovisual learning uses learned dependencies be-
t ween audition and vision, the STS should underpin the im-
provement in speech rec ognition in both c ontrols and
prosopagnosics. A similar improvement in speaker rec ognition
should be based on the FFA in controls but not prosopagnosics.
Such a neuroanatomical dissociation would imply that v isual face
processing areas are instrumental for improved auditory-only
rec ognition. We used functional magnetic resonance imaging
(fMRI) to show the response properties of these t wo areas.
The study consisted of (i) ‘‘training phase,’’ (ii) ‘‘test phase,’’ and
(iii) ‘‘face area localizer.’’ In the training phase (Fig. 2A), both
groups (17 controls and 17 prosopagnosics) learned to identify six
male speakers by voice and name. For three speakers, the voice
name learning was supplemented by a video presentation of the
moving face (‘‘voice–face’’ learning), and for the other three
speakers by a symbol of their occupation (no voice–face learning,
which we term ‘‘voiceoccupation’’ learning).
The test phase (Fig. 2B) was performed in the MRI-scanner.
Auditory-only sentences from the previously learned speakers
were presented in 29-s blocks. These sentences had not been used
during the training phase. Before each block, participants re-
ceived the visual instruction to either perform a speaker or
speech recognition task. There were four ex perimental condi-
tions in total: (i) speech task: speaker learned by face; (ii) speech
t ask: speaker learned by occupation, (iii) speaker t ask: speaker
learned by face; and (iv) speaker task: speaker learned by
oc cupation. For the speech tasks, subjects indicated by button
press whether a visually presented word occurred during the
c oncurrent auditory sentence. In the speaker t asks, subjects
indicated whether the visually presented speaker name corre-
sponded to the speaker of the auditory sentence. A nonspeech
c ontrol condition with vehicle sounds was included in the test
phase. In this condition, subjects indicated whether the v isually
displayed vehicle name (train, motorcycle, or racing car) corre-
sponded to the concurrently presented vehicle sound.
Af ter the test phase, fMRI dat a for the face area localizer were
acquired. This included passive viewing of faces and objects and
was used to localize the face-sensitive FFA and STS (see
Methods).
Results
Behavior. An overview of the behavioral results is displayed in
Table 1. We perfor med a three-way repeated-measure ANOVA
with the within-subject factors ‘‘task’’ (speech, speaker), ‘‘learn-
ing’’ (voice–face, voiceoccupation) and the between-subject
factor ‘‘group’’ (prosopagnosics, controls). There was a main
ef fect of task [F(1,32) 74.7, P 0.001]; a trend to significance
for the main effect of type of learning [F(1,32) 4.0, P 0.053];
a type of learn ing and group interaction [F(1,32) 4.8, P
0.04]; and a three-way interaction between task, type of learning,
and group [F(1,32) 5.5, P 0.03].
In both groups, prior voice–face learn ing improved speech
rec ognition, compared with voiceoc cupation learn ing. In the
following, we will call such improvement ‘‘face-benefit.’’ For
both controls and prosopagnosics there was a sign ificant face-
benefit for speech recogn ition (paired t test: speech task/voice
face vs. speech task/voiceoccupation learning: t 2.3, df 32,
P 0.03) [Fig. 3, Table 1 and supporting information (SI) Fig.
S1]. Although face-benefits of 1.22% (controls) and 1.53%
(prosopagnosics) seem small, these values are expected given the
rec ognition rates were 90% (28). There was no significant
dif ference in the face-benefit between the two groups: An
ANOVA for the speech task with the factors learn ing (voice
face, voiceoccupation) and group (prosopagnosic, controls)
Fig. 1. Model for processing of human communication signals. (A) Audio-
visual input enters auditory and visual preprocessing areas. These feed into
two distinct networks, which process speech and speaker information. Mod-
ified from ref. 4. (B) Auditory-only input enters auditory preprocessing areas.
For speech recognition, facial and vocal speech areas interact while engaging
concurrently with higher levels of speech processing. Similarly, for speaker
recognition, face and voice identity areas interact with higher levels of
speaker identity processing. Note that the interactions between the boxes do
not imply direct anatomical connections and that the boxes may represent
more than one area, in particular for higher levels of speech and speaker
recognition.
Fig. 2. Experimental design. (A) Training phase. All participants learned to
identify the same six speakers. Three were learned by voice, video of moving
face, and name. Three others were learned by voice, name and a visual symbol
of an occupation. (B) Test phase (in the MRI-scanner). Participants performed
either the speech or speaker recognition task, cued visually before each
stimulus block started. The vehicle task and the face area localizer are not
displayed in this figure.
6748
www.pnas.orgcgidoi10.1073pnas.0710826105 von Kriegstein et al.
revealed no interaction [F(1,32) 0.06, P 0.8]. There was a
main effect type of learning [F(1,32) 5.0, P 0.03], which
is consistent with a face-benefit in both groups for speech
rec ognition.
In the control g roup, there was a significant face-benefit of
5.27% for speaker recognition (paired t test: speaker task/voice
face vs. speaker task/voiceoccupation learning: t 2.5, df 16,
P 0.02). Critically, there was no face-benefit in the prosop-
agnosics for speaker recognition (t ⫽⫺0.9, df 16, P 0.4)
(Fig. 3, Table 1, and Fig. S1). An ANOVA for the speaker task
revealed a sign ificant difference of face-benefit between the
c ontrols and prosopagnosics {learning group interaction in
speaker task [F(1,32) 6.1, P 0.02]}.
We also probed whether the face-benefits in speech and
speaker rec ognition were correlated. Neither controls (Pearson:
r 0.03, P 0.9) nor prosopagnosics (Pearson: r ⫽⫺0.1, P
0.7) showed a correlation between the t wo face-benefit scores.
This means that a subject with, e.g., a high face-benefit in speaker
rec ognition does not necessarily have a high face-benefit in
speech recognition.
Neuroimaging. We performed t wo separate analyses of blood
oxygenation level-dependent (BOLD) responses acquired dur-
ing the test phase. First, we examined the effect of learning
voice–face vs. voiceoc cupation associations on the responses in
face-sensitive STS and FFA (categorical analysis, Fig. 4 A and B
and Fig. S2). In a sec ond analysis, we examined the c orrelations
bet ween behavior and regional activation over subjects in the
t wo face-sensitive areas (correlation analysis, Fig. 4 CF). In
both these analyses, we used the face area localizer to localize the
STS and FFA (see Methods).
Categorical Analysis. In both groups, activit y in face-sensitive STS
is increased after voice–face learning, for speech recognition
(Fig. 4A). There was a sign ificant interaction between learning
(voice–face vs. voiceoc cupation) and task (speech vs. speaker)
Table 1. Behavioral scores for all four experimental conditions and the vehicle control condition
Task Experimental condition
Controls Prosopagnosics
%SE % SE
Speech Voice–face 93.50 1.13 95.80 0.53
Voice–occupation 92.28 1.16 94.27 0.76
Face benefit 1.22 1.04 1.53 0.58
Speaker Voice–face 82.41 3.11 78.34 2.52
Voice–occupation 77.14 3.34 80.15 1.93
Face benefit 5.27 2.10 1.81 1.99
Vehicle 91.65 1.21 92.76 0.92
Recognition rates (%) are summarized as average over group with standard error (SE). The face-benefit is defined as the task-specific
recognition rate after voice–occupation learning subtracted from the recognition rate after voice–face learning.
Fig. 3. Behavioral results. Face-benefit in speech and speaker recognition
tasks for controls (gray) and prosopagnosics (black). The face-benefit is the
percentage difference between correct recognition after voice–face learning
minus correct recognition after voice– occupation learning. The error bars
represent standard errors.
Fig. 4. fMRI results. (A and B) Difference contrasts between voice–face and
voice–occupation learning in speech (A) and speaker recognition (B). (C )
Statistical parametric map of positive correlations of BOLD activity with the
face-benefit for speech recognition. (D) Statistical parametric map of the
difference between controls and prosopagnosics in positive correlation of
BOLD activity with the face-benefit for speaker recognition. (E ) Plot of cor-
relation between face-benefit in speech task and STS activity. (F ) Plot of
correlation between face-benefit in speaker task and FFA activity; for controls,
the correlation was significant, but not for prosopagnosics. This figure displays
the results for the ROI in the left STS. See Table S5 and Fig. S4 for results for the
ROI in the right STS.
von Kriegstein et al. PNAS
May 6, 2008
vol. 105
no. 18
6749
NEUROSCIENCE
[ANOVA: F(1,32) 24, P 0.0001)]. In each group, 15 of 17
subjects showed this effect (Table S1).
The activation in the FFA was increased after voice–face
learn ing, but only in the speaker task (Fig. 4B). There was a
sign ificant interaction bet ween learning (voice–face vs. voice
oc cupation) and task (speaker vs. speech) [ANOVA: F(1,32)
17, P 0.0001]. Fifteen of 17 subjects in both groups showed this
FFA effect (Table S2).
Correlation Analysis. For both g roups, a significant positive cor-
relation between activation and face-benefit in speech recogni-
tion was found in the left face-sensitive STS (P 0.03, corrected;
n 34; statistical maximum at x ⫽⫺56, y ⫽⫺44, z 10;
Pearson: r 0.5, P 0.006, two-tailed, n 34) but not in the
FFA (both groups, P 0.6). There was no dif ference between
groups in the FFA (P 0.5) (Fig. 4 C and E). Note that this STS
region is a visual face-sensitive area and not active during speech
in general; there is no activity in this region when contrasting all
c onditions containing speech against the c ontrol c ondition with
vehicle sounds. Furthermore activity is not higher in the speech
t ask in contrast to the speaker t ask after voiceoccupation
learn ing (Fig. S3).
For controls, we found a significant positive correlation
bet ween FFA activity and the face-benefit in speaker recogni-
tion (in controls, P 0.03, corrected; statistical maximum at x
40, y ⫽⫺42, z ⫽⫺26; Pearson: r 0.6, P 0.012, two-tailed,
n 17). This correlation was significantly greater than in
prosopagnosics (controls prosopagnosics: P 0.01, cor-
rected). There was no significant positive or negative correlation
in prosopagnosics (P 0.9; Pearson: r ⫽⫺0.4, P 0.15,
t wo-tailed, n 17) (Fig. 4 D and F). As expected, no significant
c orrelation bet ween STS activity and face-benefit in speaker
rec ognition was observed (controls prosopagnosics: P 0.9,
c orrected; c ontrols: P 0.5, corrected).
Discussion
The results are in line with our prediction that the brain ex ploits
previously acquired speaker-specific audiovisual information to
improve both auditory-only speech and speaker recognition.
Import antly, we can discount an alternative ex planation for
the face-benefits in speech and speaker rec ogn ition: Using the
auditory-only model, one c ould argue that subjects, during the
train ing phase, pay more attention to voices presented during
the voice–face learning because of the matching v isual v ideo.
In contrast, voiceoc cupation learning is based on static
stimuli. This dif ference in stimuli c ould result in an advantage
for auditory learning during voice–face association and, po-
tentially, ex plain a face-benefit. However, with this argument,
one would necessarily expect a c orrelation between the face-
benefits for speaker and speech rec ogn ition. There was no such
c orrelation. In addition, the prosopagnosics are un impaired on
auditory learn ing; we showed that they do as well as nor mal
subjects after voiceoc cupation learning. Therefore, the au-
ditory-only model predicts that the prosopagnosics show face-
benefits in both tasks, which was not observed. Rather, there
was a dif ference in the face-benefit pattern between the
c ontrols and prosopagnosics, which c onfirms a neuropsycho-
logical dissociation in terms of the face-benefits of speech and
speaker rec ognition. We can, therefore, r ule out a general
attention effect under the auditory model as an ex planation for
our results.
We conclude that subjects must have learned key audiovisual
speaker-specific attributes during the training phase. This learn-
ing was fast; 2 min of observing each speaker improved
subsequent speech recogn ition, c ompared with learning based
on arbitrary audiovisual cues. A translation of this principle into
every day life is improved telephone communication when the
speakers have previously engaged in a brief audiov isual ex-
change, for example during a meeting. The same argument
applies to speaker recognition. Control subjects identified a
speaker by voice better, if they had seen the speaker talking
before. This latter finding confirms two previous studies that
show better speaker rec ognition af ter voice–face learning (6, 8).
The audiovisual model (Fig. 1) and visual face processing
models (26) assumes two separable neural systems for the
processing of face motion (STS) and face identity (FFA). A
neuroimaging study showed that during speaker recognition
FFA activity is increased after voice–face learning (8). Our
present findings extend this result in three ways: (i) We show that
face-movement sensitive STS activity is increased af ter voice
face learning, but only during speech rec ognition; (ii) activit y of
the lef t face-sensitive STS positively correlates with the face-
benefit in speech recogn ition and FFA activity positively corre-
lates with the face-benefit for speaker rec ognition; and (iii) FFA
activity correlates positively with the face-benefit in controls but
not in prosopagnosics. These results confir m our hypothesis
about a neuroanatomical dissociation, in ter ms of selective task
and stimulus-bound response profiles in STS and FFA. We
suggest that individual dynamic facial ‘‘signatures’’ (29) are
stored in the STS and are involved in predicting the inc oming
speech c ontent. Note that these dynamic facial signatures might
also carry identity infor mation and c ould therefore be poten-
tially used to improve identity recogn ition in humans and in
primates (30–32). However, our results suggest that neither the
c ontrols nor the prosopagnosic subjects employ this information
in our experiment to improve their speaker recognition abilities.
Speech recognition during telephone conversations can be
improved by video-simulations of an artificial ‘‘talking face,’’
which helps especially hearing impaired listeners to understand
what is said (33). This creation of an artificial talking face uses
a phoneme recogn izer and a face synthesizer to recreate the
facial movements based on the auditory input. We suggest that
our results reflect that the human brain routinely uses a similar
mechan ism: Auditory-only speech processing is improved by
simulation of a t alking face. How can such a model be explained
in theoretical terms? In visual and sensory-motor processing,
‘‘internal forward’’ models have been used to ex plain how the
brain enc odes c omplex sensory dat a by relatively few parameters
(34, 35). Here, we assume the existence of audiovisual forward
models, which encode the physical causal relationship between
a person talking and its consequences for the visual and auditory
input. Critically, these models also encode the causal dependen-
cies between the visual and auditory trajectories. Perception is
based on the ‘‘inversion’’ of models; i.e., the brain identifies
causes (Mr. Smith says, ‘‘Hello’’) that explain the observed
audiovisual input best. Given that robust commun ication is of
utmost importance for us, we posit that the human brain can
quickly and efficiently learn ‘‘a new person’’ by adjusting key
parameters in existing internal audiovisual forward models that
are already geared toward perception of talking faces. Once
parameters for an individual person are learned, auditory speech
processing is improved because the brain learned parameters of
an audiovisual for ward model with strong dependencies between
internal auditory and visual trajectories. This enables the system
to simulate v isual trajectories (v ia the auditory trajectories)
when there is no visual input. The talk ing face simulation is the
better the stronger and more veridical the learned coupling
bet ween auditory and visual input is. The visual simulation is fed
back to auditory areas thereby improving auditory recognition
by providing additional constraints. This mechanism can be used
iteratively until the inversion of the audiovisual forward model
c onverges on a percept. The scheme to employ forward models
to encode and exploit dependencies in the environment by
simulation is in accordance with general theories of brain
function, which posit that neural mechan isms are tuned for
ef ficient prediction of relevant stimuli (9, 10, 12, 16, 36, 37).
6750
www.pnas.orgcgidoi10.1073pnas.0710826105 von Kriegstein et al.
We suggest that the simulation of facial features is reflected
in our results by the recruitment of visual face areas in response
to auditory stimulation. Our findings imply that there are distinct
audiovisual models for time-varying and time-const ant audio-
visual dependencies. We posit that the simulation of a face in
response to auditory speech is a general mechan ism in human
c ommunication. We predict that the same principle also applies
to other information that is correlated in the auditory and visual
domains, such as recognition of emotion from voice and face (38,
39). Further more, this scheme might be a general principle of
how unisensory tasks are performed when one or more of the
usual input modalities are missing (8, 40).
In summary, we have shown that the brain use s previously
encoded visual face information to improve subsequent auditory-
only speech and speaker recognition. The improvement in speech
and speaker recognition is behaviourally and neuroanatomically
dissociable. Speech recognition is based on selective recruitment of
the left face-sensitive STS, which is known to be involved in
orofacial movement processing (21, 22). Speaker recognition is
based on selective recruitment of the FFA, which is involved in
face-identity processing (23–25). These findings challenge auditory-
only models for speech processing and lead us to conclude that
human communication involves at least two distinct audiovisual
networks for auditory speech and speaker recognition. The exis-
tence of an optimized and robust scheme for human speech
processing is a key requirement for efficient communication and
succe ssful social interactions.
Methods
Participants. In total, 17 healthy volunteers (10 females, 14 right handed,
22–52 years of age, mean age 37.4 years, median 38 years) and 17 prosopag-
nosics (11 females, 17 right handed, 24–57 years of age, mean age 37.2 years,
median 34 years) were included into the study (SI Methods, Participants).
Prosopagnosia Diagnosis. The diagnosis of hereditary prosopagnosia was
based on a standardized semistructured interview (Tables S3 and S4) (41, 42),
which has been validated with objective face recognition tests in previous
studies (41, 43).
Stimuli. For a detailed description of the stimuli, see SI Methods, Stimuli.
Experimental Design.
Training phase.
All participants were trained outside the
MRI-scanner. In each trial the name of the speaker was first presented (for 1 s)
followed by presentation of a sentence spoken by that speaker (1.3 s). For
three of the speakers, the sentences were presented together with the video
of the speaking face (voice–face learning). Three other speakers’ voices were
presented together with static symbols for three different occupations
(painter, craftsman, and cook) (voice–occupation learning). The two sets of
speakers were counterbalanced over participants: In each group, nine partic-
ipants learned the first set of speakers with the faces and the second set with
the symbols, whereas the other eight participants learned the reverse order.
Total exposure to audiovisual information about a speaker was 2 min
(SI Methods, Experimental Design).
Test phase.
The test phase consisted of three 15-min MRI-scanning sessions
and included four speech and one nonspeech condition (see Introduction).
Before the first session, participants were briefly familiarized, inside the
scanner, with the setting by showing them a single trial of each task. Stimuli
(auditory sentences or vehicle sounds) were presented in a block design.
Blocks were presented fully randomized. There were 12 blocks per condi-
tion in total. Each block lasted 29 s and contained eight trials. One trial
lasted 3.6 s and consisted of two consecutive sentences spoken by the
same person or two vehicle sounds. In the last second of each trial, a written
word (speech task), person name (speaker task), or vehicle name (vehicle
task) was presented. Subjects indicated via button press whether the shown
word was present in the spoken sentence (speech task) and whether the
shown person name matched the speaker’s voice (speaker task) or not.
Similarly in the vehicle task, subjects indicated whether the vehicle name
matched the vehicle sound or not. Between blocks, subjects looked at a
fixation cross lasting 12 s.
Face area localizer.
The visual localizer study consisted of two 6-min MRI-scanning
sessions and included four conditions of passive viewing of face or object pictures:
(i) faces from different persons with different facial gestures (speaking), (ii)
different facial gestures of the same person’s face, (iii) different objects in
different views, and (iv) same object in different views. Conditions were pre-
sented as blocks of 25-s duration. Within the blocks, single stimuli were presented
for 500 ms without pause between stimuli. This fast stimulus presentation in-
duced a movement illusion in the condition where the same person’s face was
presented (moving face), but not in those with faces from different persons (static
faces). A fixation cross was introduced between the blocks for 18 s.
Data Acquisition and Analysis. MRI was performed on a 3-T Siemens Vision
scanner (SI Methods, Data acquisition), and the data were analyzed with SPM5
(www.fil.ion.ucl.ac.uk/spm), using standard procedures (SI Methods, Analysis
of MRI data).
Behavioural data were analyzed by using SPSS 12.02 (SPSS). All P values
reported are two-tailed.
Localization of face-sensitive areas.
We defined the regions of interest (ROI) by
using the face area localizer. The STS-ROI was defined by the contrast moving
face vs. static faces. In the group analysis, this contrast was used to inclusively
mask the contrast face vs. object (maximum for both groups in left STS: x
52, y ⫽⫺56, z 6, cluster size 19 voxels). The localizer contrast also included
a region in the right STS (x 54, y ⫽⫺40, z 6, cluster size 737 voxels). We
report analyses within this region in SI Methods, Table S5, and Fig. S4. The
FFA-ROI was defined by the contrast faces vs. objects (maximum for both
groups was in the right FFA: x 44, y ⫽⫺44, z ⫽⫺24, cluster size 20 voxels)
(SI Methods, Face area localizer). There was no homologous significant
activity in the left hemisphere. The statistical maxima for individual subjects
are displayed in (Tables S1 and S2).
Categorical analysis for test phase.
In the categorical analysis, contrasts of interest
were the interactions (i) (speech task/voice–face learning speech task/voice–
occupation learning) (speaker task/voice–face learning speaker task/voice–
occupation learning) and (ii) (speaker task/voice–face learning speaker task/
voice–occupation learning) (speech task/voice–face learning speech task/
voice–occupation learning). These contrasts were computed at the single subject
level. For each subject’s FFA and STS (as determined by the face area localizer),
parameter estimates were extracted from the voxel, at which we found the
maximum statistic (SI Methods, Categorical analysis, and Tables S1 and S2). These
values were then entered into a repeated measures ANOVA and plotted (Fig. 4 A
and B) by using SPSS 12.02.
Correlation analysis for test phase.
In the correlation analysis, the fMRI signal in
FFA and STS after voice–face learning was correlated with the behavioral
face-benefit, i.e., recognition rate (%) after voice–face learning minus recog-
nition rate (%) after voice– occupation learning, as determined separately for
speech and speaker task. This group analysis was performed by using the
MarsBaR ROI toolbox (http://marsbar.sourceforge.net) (SI Methods, Correla-
tion analysis). To estimate Pearson’s r values, parameter estimates were
extracted at the group maximum and entered into SPSS 12.02.
ACKNOWLEDGMENTS. We thank Chris Frith, Karl Friston, Peter Dayan, and
Tim Griffiths for comments on the manuscript and Stefanie Dahlhaus for
providing the voice–face videos. This study was supported by the Volkswa-
genStiftung and Wellcome Trust.
1. Sumby WH, Pollack I (1954) Visual contribution to speech intelligibility in noise. J
Acoust Soc Am 26:212–215.
2. van Wassenhove V, Grant KW, Poeppel D (2005) Visual speech speeds up the neural
processing of auditory speech. Proc Natl Acad Sci USA 102:1181–1186.
3. Schweinberger SR, Robertson D, Kaufmann JM (2007) Hearing facial identities. Q J Exp
Psychol (Colchester) 60:1446–1456.
4. Belin P, Fecteau S, Bedard C (2004) Thinking the voice: Neural correlates of voice
perception. Trends Cognit Sci 8:129 –135.
5. Braida LD (1991) Crossmodal integration in the identification of consonant segments.
Q J Exp Psychol A 43:647– 677.
6. Sheffert SM, Olson E (2004) Audiovisual speech facilitates voice learning. Percept
Psychophys 66:352–362.
7. von Kriegstein K, Kleinschmidt A, Giraud AL (2006) Voice recognition and cross-modal
responses to familiar speakers’ voices in prosopagnosia. Cereb Cortex 16:1314–1322.
8. von Kriegstein K, Giraud AL (2006) Implicit multisensory associations influence voice
recognition. PLoS Biol 4:e326.
9. Deneve S, Duhamel JR, Pouget A (2007) Optimal sensorimotor integration in recurrent
cortical networks: A neural implementation of Kalman filters. J Neurosci 27:5744 –5756.
10. Friston K (2005) A theory of cortical responses. Philos Trans R Soc London Ser B
360:815–836.
11. Halle M (2002) From Memory to Speech and Back: Papers on Phonetics and Phonology,
1954–2002 (de Gruyter, Berlin).
12. Knill D, Kersten D, Yuille A (1998) in Perception as Bayesian Inference, eds Knill D,
Richards W (Cambridge Univ Press, Cambridge, UK), pp 1–21.
von Kriegstein et al. PNAS
May 6, 2008
vol. 105
no. 18
6751
NEUROSCIENCE
13. Ellis HD, Jones DM, Mosdell N (1997) Intra- and inter-modal repetition priming of
familiar faces and voices. Br J Psychol 88:143–156.
14. Hickok G, Poeppel D (2007) The cortical organization of speech processing. Nat Rev
Neurosci 8:393– 402.
15. Dayan P (2006) Images, frames, and connectionist hierarchies. Neural Comput 18:2293–2319.
16. Rao RP, Ballard DH (1999) Predictive coding in the visual cortex: A functional interpre-
tation of some extra-classical receptive-field effects. Nat Neurosci 2:79 87.
17. Fant G (1960) Acoustic Theory of Speech Production (Mouton, Paris).
18. Yehia H, Rubin P, Vatikiotis-Bateson E (1998) Quantitative association of vocal-tract
and facial behavior. Speech Commun 26:23–43.
19. Lavner Y, Gath I, Rosenhouse J (2000) The effects of acoustic modifications on the
identification of familiar voices speaking isolated vowels. Speech Commun 30:9–26.
20. Humphreys K, Avidan G, Behrmann M (2007) A detailed investigation of facial expres-
sion processing in congenital prosopagnosia as compared to acquired prosopagnosia.
Exp Brain Res 176:356 –373.
21. Calvert GA, et al. (1997) Activation of auditory cortex during silent lipreading. Science
276:593–596.
22. Puce A, Allison T, Bentin S, Gore JC, McCarthy G (1998) Temporal cortex activation in
humans viewing eye and mouth movements. J Neurosci 18:2188–2199.
23. Eger E, Schyns PG, Kleinschmidt A (2004) Scale invariant adaptation in fusiform
face-responsive regions. NeuroImage 22:232–242.
24. Loffler G, Yourganov G, Wilkinson F, Wilson HR (2005) fMRI evidence for the neural
representation of faces. Nat Neurosci 8:1386–1390.
25. Rotshtein P, Henson RN, Treves A, Driver J, Dolan RJ (2005) Morphing Marilyn into
Maggie dissociates physical and identity face representations in the brain. Nat Neurosci
8:107–113.
26. Haxby JV, Hoffman EA, Gobbini MI (2000) The distributed human neural system for
face perception. Trends Cognit Sci 4:223–233.
27. Kanwisher N, Yovel G (2006) The fusiform face area: A cortical region specialized for
the perception of faces. Philos Trans R Soc London Ser B 361:2109 –2128.
28. Dupont S, Luettin J (2000) Audio-visual speech modeling for continous speech recog-
nition. IEEE Trans Multimedia 2:141–151.
29. O’Toole AJ, Roark DA, Abdi H (2002) Recognizing moving faces: A psychological and
neural synthesis. Trends Cognit Sci 6:261–266.
30. Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK (2005) Multisensory integration
of dynamic faces and voices in rhesus monkey auditory cortex. J Neurosci 25:5004
5012.
31. Lander K, Davies R (2007) Exploring the role of characteristic motion when learning
new faces. Q J Exp Psychol (Colchester) 60:519 –526.
32. Kamachi M, Hill H, Lander K, Vatikiotis-Bateson E (2003) Putting the face to the voice:
Matching identity across modality. Curr Biol 13:1709–1714.
33. Siciliano C, Williams G, Beskow J, Faulkner A (2002) Evaluation of a multilingual
synthetic talking face as a communication aid for the hearing-impaired. Speech Hear
Lang Work Prog 14:51– 61.
34. Ballard DH, Hinton GE, Sejnowski TJ (1983) Parallel visual computation. Nature 306:21–
26.
35. Kawato M, Hayakawa H, Inui T (1983) A forward-inverse optics model of reciprocal
connections between visual cortical areas. Network-Comput Neural Syst 4:415–422.
36. Bar M (2007) The proactive brain: Using analogies and associations to generate
predictions. Trends Cognit Sci 11:280 –289.
37. Wolpert DM, Ghahramani Z, Jordan MI (1995) An internal model for sensorimotor
integration. Science 269:1880 –1882.
38. de Gelder B, Pourtois G, Weiskrantz L (2002) Fear recognition in the voice is modulated
by unconsciously recognized facial expressions but not by unconsciously recognized
affective pictures. Proc Natl Acad Sci USA 99:4121– 4126.
39. de Gelder B, Morris JS, Dolan RJ (2005) Unconscious fear influences emotional aware-
ness of faces and voices. Proc Natl Acad Sci USA 102:18682–18687.
40. Amedi A, et al. (2007) Shape conveyed by visual-to-auditory sensory substitution
activates the lateral occipital complex. Nat Neurosci 10:687–689.
41. Gruter M, et al. (2007) Hereditary prosopagnosia: The first case series. Cortex 43:734
749.
42. Kennerknecht I, et al. (2006) First report of prevalence of non-syndromic hereditary
prosopagnosia (HPA). Am J Med Genet A 140:1617–1622.
43. Carbon CC, Grueter T, Weber JE, Lueschow A (2007) Faces as objects of non-expertise:
Processing of thatcherised faces in congenital prosopagnosia. Perception 36:1635–
1645.
6752
www.pnas.orgcgidoi10.1073pnas.0710826105 von Kriegstein et al.
... There are a small number of studies showing that voice learning can also be facilitated in this manner, e.g., [5,18]; see also [19]. von Kriegstein and colleagues [18] examined this question by performing behavioral and neuroimaging experiments with prosopagnosics and matched controls. ...
... There are a small number of studies showing that voice learning can also be facilitated in this manner, e.g., [5,18]; see also [19]. von Kriegstein and colleagues [18] examined this question by performing behavioral and neuroimaging experiments with prosopagnosics and matched controls. Participants either both heard and saw three talkers (voice-face condition) uttering sentences or just heard sentences while being presented with static symbols of different occupations (voice-occupation condition). ...
... This could mean that FFA activity during audio-only talker recognition (for talkers who were part of the voice-face learning condition) might have optimized voice recognition. It is important to note that the study of von Kriegstein et al. [18] used (a) nonface test stimuli (vehicle sounds) that are different from the more common audio-alone control stimuli more typically used in the literature, and(b) the training phase only included a small set of talkers (three). ...
Article
Full-text available
It is known that talkers can be recognized by listening to their specific vocal qualities—breathiness and fundamental frequencies. However, talker identification can also occur by focusing on the talkers’ unique articulatory style, which is known to be available auditorily and visually and can be shared across modalities. Evidence shows that voices heard while seeing talkers’ faces are later recognized better on their own compared to the voices heard alone. The present study investigated whether the facilitation of voice learning through facial cues relies on talker-specific articulatory or nonarticulatory facial information. Participants were initially trained to learn the voices of ten talkers presented either on their own or together with (a) an articulating face, (b) a static face, or (c) an isolated articulating mouth. Participants were then tested on recognizing the voices on their own regardless of their training modality. Consistent with previous research, voices learned with articulating faces were recognized better on their own compared to voices learned alone. However, isolated articulating mouths did not provide an advantage in learning the voices. The results demonstrated that learning voices while seeing faces resulted in better voice learning compared to the voices learned alone.
... In auditory-only experiment, it is reported that faceprocessing areas in the brain are instrumental for improving speech and speaker recognition performances even without visual facial input [18]. Also, it is known that there is an own-gender bias in face recognition, that is, females remember more female faces than males do, but not more male faces [19]. ...
Article
Full-text available
Human speaker recognition performance can be degraded by various factors. Understanding the factors affecting it and the errors caused by these factors is crucial for forensic applications. To study the effects of noisy environments on human speaker recognition, we conducted a hearing experiment using speech samples of two words by five male speakers, and two noise types (speech-like noise and environmental noise in boiler room) with three steps of signal-to-noise ratio (∞, 0 dB, or −10 dB). The results suggested that the listeners tended to observe different speakers to be the same speaker rather than vice versa, and this tendeny was also affected by sex of the listener.
... Such studies would be an important contribution to the memory literature, which reports largely conflicting observations. For example, different studies have found the simultaneous presentation of face and voice stimuli help (e.g., Maguinness et al., 2021;von Kriegstein et al., 2008;Zäske et al., 2015) and hinder (e.g., Lavan et al., 2023;Cook & Wilding, 2001;Tomlin et al., 2017) voice identity learning and recognition. ...
Article
Full-text available
Emotional stimuli and events are better and more easily remembered than neutral ones. However, this advantage appears to come at a cost, namely a decreased accuracy for peripheral, emotion-irrelevant details. There is some evidence, particularly in the visual modality, that this trade-off also applies to emotional expressions, leading to a difficulty in identifying an unfamiliar individual’s identity when presented with an expression different from the one encountered at encoding. On the other hand, past research also suggests that identity recognition memory benefits from exposure to different encoding exemplars, although whether this is also the case for emotional expressions, particularly voices, remains unknown. Here, we directly addressed these questions by conducting a series of voice and face identity memory online studies, using a within-subject old/new recognition test in separate unimodal modules. In the Main Study, half of the identities were encoded with four presentations of one single expression (angry, fearful, happy, or sad; Uni condition) and the other half with one presentation of each emotion (Multi condition); all identities, intermixed with an equal number of new ones, were presented with a neutral expression in a subsequent recognition test. Participants (N = 547, 481 female) were randomly assigned to one of four groups in which a different Uni single emotion was used. Results, using linear mixed models on response choice and drift-diffusion-model parameters, revealed that high-arousal expressions interfered with emotion-independent identity recognition accuracy, but that such deficit could be compensated by presenting the same individual with various expressions (i.e., high exemplar variability). These findings were confirmed by a significant correlation between memory performance and stimulus arousal, across modalities and emotions, and by two follow-up studies (Study 1: N = 172, 150 female; Study 2: N = 174, 154 female), which extended the original observations and ruled out some potential confounding effects. Taken together, the findings reported here expand and refine our current knowledge of the influence of emotion on memory, and highlight the importance of, and interaction between, exemplar variability and emotional arousal in identity recognition memory.
... Visual regions like the putative 'visual word form area' (110) and face and motion processing regions (111,112) play roles in audio-only speech perception, even when there is no visual information available. ...
Preprint
Full-text available
Models of the neurobiology of language suggest that a small number of anatomically fixed brain regions are responsible for language functioning. This observation derives from centuries of examining brain lesions causing aphasia and is supported by decades of neuroimaging studies. The latter rely on thresholded measures of central tendency applied to activity patterns resulting from heterogeneous stimuli. We hypothesised that these methods obscure the whole brain distribution of regions supporting language. Specifically, cortical 'language regions' and the corresponding 'language network' consist of input regions and connectivity hubs. The latter primarily coordinate peripheral regions whose activity is variable, making them likely to be averaged out following thresholding. We tested these hypotheses in two studies using neuroimaging meta-analyses and functional magnetic resonance imaging during film watching. Both converged to suggest that averaging over heterogeneous words is localised to regions historically associated with language but distributed throughout most of the brain when not averaging over the sensorimotor properties of those words. The localised word regions are composed of highly central hubs. The film data further shows that these hubs are dynamic, connected to peripheral regions, and only appear in the aggregate across time. Results suggest that 'language regions' are an artefact of indiscriminately averaging across heterogeneous language representations and linguistic processes. Rather, they are mostly dynamic connectivity hubs coordinating whole-brain distributions of networks for processing the complexities of real-world language use, explaining why damage to them results in aphasia.
... In addition to the representation of the sound of a familiar voice, specific memories associated with the person will be accessible to a listener once a voice is recognised. The sound of a familiar voice may also activate representations of the person's face 30,31 , alongside biographical knowledge about them (even for people we 'know' but have never met, such as celebrities). Similarly, we can access emotionally and socially salient information about, for example, whether we like this person or not, as well as specific memories of events and situations involving them 5,6 . ...
Article
Full-text available
When hearing a voice, listeners can form a detailed impression of the person behind the voice. Existing models of voice processing focus primarily on one aspect of person perception - identity recognition from familiar voices - but do not account for the perception of other person characteristics (e.g., sex, age, personality traits). Here, we present a broader perspective, proposing that listeners have a common perceptual goal of perceiving who they are hearing, whether the voice is familiar or unfamiliar. We outline and discuss a model - the Person Perception from Voices (PPV) model - that achieves this goal via a common mechanism of recognising a familiar person, persona, or set of speaker characteristics. Our PPV model aims to provide a more comprehensive account of how listeners perceive the person they are listening to, using an approach that incorporates and builds on aspects of the hierarchical frameworks and prototype-based mechanisms proposed within existing models of voice identity recognition.
... Moreover, an apparently universal and unexplained feature of multisensory learning is that it improves subsequent memory performance even for the separate unisensory components 3,5 . Studies in humans and other mammals have suggested that multisensory learning benefits from interactions between modality-specific cortices that were co-active during training, and that individual senses can reactivate both areas at testing 3,[6][7][8][9] . In addition, cells in different brain regions respond to multiple sensory cues and the proportions or numbers change after multisensory learning 1,10-12 . ...
Article
Full-text available
Associating multiple sensory cues with objects and experience is a fundamental brain process that improves object recognition and memory performance. However, neural mechanisms that bind sensory features during learning and augment memory expression are unknown. Here we demonstrate multisensory appetitive and aversive memory in Drosophila. Combining colours and odours improved memory performance, even when each sensory modality was tested alone. Temporal control of neuronal function revealed visually selective mushroom body Kenyon cells (KCs) to be required for enhancement of both visual and olfactory memory after multisensory training. Voltage imaging in head-fixed flies showed that multisensory learning binds activity between streams of modality-specific KCs so that unimodal sensory input generates a multimodal neuronal response. Binding occurs between regions of the olfactory and visual KC axons, which receive valence-relevant dopaminergic reinforcement, and is propagated downstream. Dopamine locally releases GABAergic inhibition to permit specific microcircuits within KC-spanning serotonergic neurons to function as an excitatory bridge between the previously ‘modality-selective’ KC streams. Cross-modal binding thereby expands the KCs representing the memory engram for each modality into those representing the other. This broadening of the engram improves memory performance after multisensory learning and permits a single sensory feature to retrieve the memory of the multimodal experience.
... This is evidenced by brain-imaging data showing functional and structural connections between a voice and face-sensitive areas in the brain (reviewed in [5]). Specifically, it has been demonstrated that voices alone [2] can activate the fusiform face area (FFA) for personally familiar speakers, as well as for learned, beforehand-unfamiliar, speakers [6]. Moreover, the existence of fiber tracts connecting the FFA with voice-sensitive areas in the superior temporal sulcus (STS) has also been demonstrated [7]. ...
Article
Full-text available
Recognizing people from their voices may be facilitated by a voice’s distinctiveness, in a manner similar to that which has been reported for faces. However, little is known about the neural time-course of voice learning and the role of facial information in voice learning. Based on evidence for audiovisual integration in the recognition of familiar people, we studied the behavioral and electrophysiological correlates of voice learning associated with distinctive or non-distinctive faces. We repeated twelve unfamiliar voices uttering short sentences, together with either distinctive or non-distinctive faces (depicted before and during voice presentation) in six learning-test cycles. During learning, distinctive faces increased early visually-evoked (N170, P200, N250) potentials relative to non-distinctive faces, and face distinctiveness modulated voice-elicited slow EEG activity at the occipito–temporal and fronto-central electrodes. At the test, unimodally-presented voices previously learned with distinctive faces were classified more quickly than were voices learned with non-distinctive faces, and also more quickly than novel voices. Moreover, voices previously learned with faces elicited an N250-like component that was similar in topography to that typically observed for facial stimuli. The preliminary source localization of this voice-induced N250 was compatible with a source in the fusiform gyrus. Taken together, our findings provide support for a theory of early interaction between voice and face processing areas during both learning and voice recognition.
Article
Full-text available
Face recognition is important for both visual and social cognition. While prosopagnosia or face blindness has been known for seven decades and face specific neurons for half a century, the molecular genetic mechanism is not clear. Here we report results after 17 years of research with classic genetics and modern genomics. From a large family with 18 congenital prosopagnosia (CP) members with obvious difficulties in face recognition in daily life, we uncovered a fully cosegregating private mutation in the MCTP2 gene which encodes a calcium binding transmembrane protein expressed in the brain. After screening through cohorts of 6589, we found more CPs and their families, allowing detection of more CP associated mutations in MCTP2. Face recognition differences were detected between 14 carriers with the frameshift mutation S80fs in MCTP2 and 19 non-carrying volunteers. 6 families including one with 10 members showed the S80fs-CP correlation. Functional magnetic resonance imaging found association of impaired recognition of individual faces by MCTP2 mutant CPs with reduced repetition suppression to repeated facial identities in the right fusiform face area. Our results have revealed genetic predisposition of MCTP2 mutations in CP, 76 years after the initial report of prosopagnosia and 47 years after the report of the first CP. This is the first time a gene required for a higher form of visual social cognition was found in humans.
Article
Person-knowledge encompasses the diverse types of knowledge we have about other people. This knowledge spans the social, physical, episodic, semantic & nominal information we possess about others and is served by a distributed cortical network including core (perceptual) and extended (non-perceptual) subsystems. Our understanding of this cortical system is tightly linked to the perception of faces and the extent to which cortical knowledge-access processes are independent of perception is unclear. In this study, participants were presented with the written names of famous people and performed ten different semantic access tasks drawn from five cognitive domains (biographic, episodic, nominal, social and physical). We used representational similarity analysis, adapted to investigate network-level representations (NetRSA) to characterise the inter-regional functional coordination within the non-perceptual extended subsystem across access to varied forms of person-knowledge. Results indicate a hierarchical cognitive taxonomy consistent with that seen during face-processing and forming the same three macro-domains: socio-perceptual judgements, episodic-semantic memory and nominal knowledge. The coordination across regions was largely preserved within elements of the extended system associated with internalised cognition but differed in prefrontal regions. Results suggest the elements of the extended system work together in a consistent way to access knowledge when viewing faces and names but that coordination patterns also change as a function of input-processing demands.
Article
Full-text available
Humans recognize one another by identifying their voices and faces. For sighted people, the integration of voice and face signals in corresponding brain networks plays an important role in facilitating the process. However, individuals with vision loss primarily resort to voice cues to recognize a person’s identity. It remains unclear how the neural systems for voice recognition reorganize in the blind. In the present study, we collected behavioral and resting-state fMRI data from 20 early blind (5 females; mean age = 22.6 years) and 22 sighted control (7 females; mean age = 23.7 years) individuals. We aimed to investigate the alterations in the resting-state functional connectivity (FC) among the voice- and face-sensitive areas in blind subjects in comparison with controls. We found that the intranetwork connections among voice-sensitive areas, including amygdala-posterior “temporal voice areas” (TVAp), amygdala-anterior “temporal voice areas” (TVAa), and amygdala-inferior frontal gyrus (IFG) were enhanced in the early blind. The blind group also showed increased FCs of “fusiform face area” (FFA)-IFG and “occipital face area” (OFA)-IFG but decreased FCs between the face-sensitive areas (i.e., FFA and OFA) and TVAa. Moreover, the voice-recognition accuracy was positively related to the strength of TVAp-FFA in the sighted, and the strength of amygdala-FFA in the blind. These findings indicate that visual deprivation shapes functional connectivity by increasing the intranetwork connections among voice-sensitive areas while decreasing the internetwork connections between the voice- and face-sensitive areas. Moreover, the face-sensitive areas are still involved in the voice-recognition process in blind individuals through pathways such as the subcortical-occipital or occipitofrontal connections, which may benefit the visually impaired greatly during voice processing.
Article
Full-text available
We propose that the feedforward connection from the lower visual cortical area to the higher visual cortical area provides an approximated inverse model of the imaging process (optics), while the backprojection connection from the higher area to the lower area provides a forward model of the optics. By mathematical analysis and computer simulation, we show that a small number of relaxation computations circulating this forward-inverse optics hierarchy achieves fast and reliable integration of vision modules, and therefore might resolve the following problems. (i) How are parallel visual modules (multiple visual cortical areas) integrated to allow a coherent scene perception? (ii) How can ill-posed vision problems be solved by the brain within several hundreds of milliseconds? © 1993 Informa UK Ltd All rights reserved: reproduction in whole or part not permitted.
Article
Watching a speaker’s lips during face-to-face conversation (lipreading) markedly improves speech perception, particularly in noisy conditions. With functional magnetic resonance imaging it was found that these linguistic visual cues are sufficient to activate auditory cortex in normal hearing individuals in the absence of auditory speech sounds. Two further experiments suggest that these auditory cortical areas are not engaged when an individual is viewing nonlinguistic facial movements but appear to be activated by silent meaningless speechlike movements (pseudospeech). This supports psycholinguistic evidence that seen speech influences the perception of heard speech at a prelexical stage.
Article
Speech perception provides compelling examples of a strong link between auditory and visual modalities [1, 2]. This link originates in the mechanics of speech production, which, in shaping the vocal tract, determine the movement of the face as well as the sound of the voice [3, 4]. In this paper, we present evidence that equivalent information about identity is available cross-modally from both the face and voice. Using a delayed matching to sample task, XAB, we show that people can match the video of an unfamiliar face, X, to an unfamiliar voice, A or B, and vice versa, but only when stimuli are moving and are played forward. The critical role of time-varying information is underlined by the ability to match faces to voices containing only the coarse spatial and temporal information provided by sine wave speech [5]. The effect of varying sentence content across modalities was small, showing that identity-specific information is not closely tied to particular utterances. We conclude that the physical constraints linking faces to voices result in bimodally available dynamic information, not only about what is being said, but also about who is saying it.
Article
The author regrets that there was a mistake in reference [37] in the above article. The correct reference is:Oliva, A. and Torralba, A. (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175.The author sincerely apologizes for any problems that this error may have caused.
Article
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Article
"Oral speech intelligibility tests were conducted with, and without, supplementary visual observation of the speaker's facial and lip movements. The difference between these two conditions was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test. The visual contribution to oral speech intelligibility (relative to its possible contribution) is, to a first approximation, independent of the speech-to-noise ratio under test. However, since there is a much greater opportunity for the visual contribution at low speech-to-noise ratios, its absolute contribution can be exploited most profitably under these conditions." (PsycINFO Database Record (c) 2012 APA, all rights reserved)