ArticlePDF Available

Adaptation to spectrally-rotated speech

Authors:

Abstract and Figures

Much recent interest surrounds listeners' abilities to adapt to various transformations that distort speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and contrasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent 6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an untrained control group completed pre- and post-training speech perception tests, for which talkers differed from the training talker. Significantly improved recognition of spectrally-rotated sentences was observed for trained, but not untrained, participants. However, there were no significant improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants. Additional tests were performed with speech materials manipulated so as to isolate the contribution of various speech features. These showed that preserving intonational contrasts did not contribute to the comprehension of spectrally-rotated speech after training, and suggested that improvements involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on speech features relatively unaffected by the transformation.
Content may be subject to copyright.
Adaptation to spectrally-rotated speech
Tim Green, Stuart Rosen, Andrew Faulkner, and Ruth Paterson
Speech, Hearing, and Phonetic Sciences, UCL, Chandler House, 2, Wakefield Street, London, WC1N 1PF,
United Kingdom
(Received 20 December 2012; revised 28 May 2013; accepted 10 June 2013)
Much recent interest surrounds listeners’ abilities to adapt to various transformations that distort
speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech
is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely
altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and con-
trasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent
6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an
untrained control group completed pre- and post-training speech perception tests, for which talkers
differed from the training talker. Significantly improved recognition of spectrally-rotated sentences
was observed for trained, but not untrained, participants. However, there were no significant
improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants.
Additional tests were performed with speech materials manipulated so as to isolate the contribution
of various speech features. These showed that preserving intonational contrasts did not contribute
to the comprehension of spectrally-rotated speech after training, and suggested that improvements
involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on
speech features relatively unaffected by the transformation.
V
C
2013 Acoustical Society of America .
[http://dx.doi.org/10.1121/1.4812759]
PACS number(s): 43.71.Sy [JMH] Pages: 1369–1377
I. INTRODUCTION
Listeners possess considerable abilities to adapt to trans-
formations which, to various extents, degrade and distort im-
portant features of speech signals. Examples include noise-
or tone-excited vocoding (Hill et al., 1968; Shannon et al.,
1995; Dorman et al., 1997a), sine-wave speech (Remez
et al., 1981), time-compressed speech (Dupoux and Green,
1997), and spectral rotation (Blesser, 1972). The speed
and degree of adaptation vary across transformations.
Investigation of factors that contribute to adaptation and its
limitations may provide valuable insights into perceptual
learning processes and inform models of speech perception.
Adaptation to noise- or tone-vocoded speech has
received considerable interest, not least because this type of
processing has features in common with the processing typi-
cally applied in cochlear implant systems (Davis et al.,
2005; Hervais-Adelman et al., 2008; Hervais-Adelman
et al., 2011; Loebach et al., 2010). Spectral resolution is lim-
ited to a small number of broad frequency bands, temporal
fine structure is eliminated, but amplitude envelopes within
each frequency band are preserved. Learning of speech that
has been tone- or noise-vocoded but not subject to other dis-
tortion is very rapid. For example, using six-channel noise-
vocoded speech, Davis et al. (2005) reported that sentence
recognition improved from near zero to 70% words correct
over the course of presentation of just 30 sentences. One im-
portant finding has been that improvement after training is
seen for words that were not heard during training, suggest-
ing that learning involves modification of the processing of
phonetic cues at a sublexical level (Davis et al., 2005).
Similarly, with time-compressed speech, the degree to which
learning transfers across languages has been found to depend
on the phonological similarity between the languages
(Pallier et al., 1998), suggesting that learning occurred at the
phonetic level.
The rapidity of adaptation to vocoded speech probably
reflects the fact that, while fine spectral detail is lost, the over-
all shape and position of the spectral envelope are well pre-
served. For cochlear implant users, the representation
of speech spectral information is subject to more complex
transformations than those involved in straightforward vocod-
ing. For example, post-lingually deafened cochlear implant
users must adapt to some change of frequency to place map-
ping, since it is highly unlikely that all of the electrode con-
tacts will be at tonotopically correct places. Typically, an
overall upward spectral shift will arise due to incomplete
insertion of the electrode array (Ketten et al.,1998). Short-
term studies using noise-excited vocoding in normal hearing
listeners have shown that such shifts in spectral envelope
have a highly detrimental effect on speech perception, far
beyond that imposed by vocoding per se, and largely inde-
pendent of the degree of spectral resolution (Dorman et al.,
1997b; Shannon et al.
, 1998). However, a few hours of train-
ing has been shown to be sufficient to lead to significant
improvements in sentence recognition for speech that has
been both noise-vocoded and spectrally-shifted (Faulkner
et al., 2006; Fu and Galvin, 2003; Rosen et al.,1999).
Similarly, Smith and Faulkner (2006) showed that adaptation
was possible to noise-vocoded speech in which the fre-
quency-to-place map was “warped.” This simulated a situa-
tion in which there is a “dead” cochlear region with no
functional neurons, and the frequency map is adjusted so as
to distribute spectral information from the whole signal over
the functioning cochlear regions on either side of the dead
region.
J. Acoust. Soc. Am. 134 (2), August 2013
V
C
2013 Acoustical Society of America 13690001-4966/2013/134(2)/1369/9/$30.00
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
The addition of spectral shifting or warping results in
more complex transformations of speech spectral informa-
tion than noise- or tone-vocoding alone. However, these are
still monotonic transformations that largely preserve the rel-
ative shape of the spectral envelope. Substantially greater
difficulties in adaptation might be anticipated for distortions
of speech signals that involve non-monotonic transforma-
tions of the spectral envelope. Some form of non-
monotonicity might occur in cochlear implant users due to a
range of factors that affect the extent to which current deliv-
ered to a particular electrode is effective in stimulating an
appropriate population of auditory nerve fibers, such as the
possibility of cross-turn stimulation (e.g., Finley et al.,
2008). Such distortions are even more likely in users of audi-
tory brainstem implants (ABIs), where assignment of audi-
tory filters to electrodes relies on clinical pitch ranking
procedures. The difficulties inherent in such procedures
make it more difficult to establish a tonotopic pattern of elec-
trical stimulation (Colletti et al., 2012).
An extreme example of a non-monotonic spectral trans-
formation is spectral rotation, in which the bandwidth of the
signal is first restricted by low-pass filtering and the spec-
trum is then inverted around the center frequency (Blesser,
1969, 1972; Azadpour and Balaban, 2008). Some speech
features are more or less unaffected by spectral rotation,
including amplitude envelope, the presence or absence of pe-
riodicity, and pitch variation that conveys intonation.
However, the inversion of spectral shape and dynamics
makes rotated speech completely unintelligible for naive lis-
teners. That spectrally-rotated speech retains many of the
acoustical properties of actual speech while being unintelli-
gible has led to its widespread use as a non-speech control in
neuroimaging studies of speech perception (e.g., Scott et al.,
2000).
However, it appears that considerable adaptation to this
extreme transformation is possible over a fairly short time.
Blesser (1969, 1972) provided experience with spectral rota-
tion in several 30-min sessions. Pairs of participants who
were well-known to each other heard each other’s speech
only in spectrally-rotated form and communicated purely by
auditory means, using any approach that they found practi-
cal. A range of speech tests, including vowel and consonant
perception and recognition of single words and sentences,
were carried out at various times during the course of the
experiment.
Learning was observed both for the identification of
vowels and consonants and for comprehension of whole sen-
tences, with sentence scores reaching 35% syllables correct
on average after up to 10 h experience. A few points should
be noted here, however. First, in contrast to typical contem-
porary practice, scoring was based not just on key words but
on all syllables within a sentence. Second, participants were
tested with the same sentence list on repeated occasions.
Finally, there was large variability across participants, which
may in part reflect differences in the approaches to learning
adopted by different pairs of participants. Blesser speculated
that there may be a particularly important role for intonation
in the comprehension of spectrally-rotated speech, although
this was not tested directly. Spectral rotation of voiced
speech destroys the original harmonic spectral structure
since the fundamental frequency and all its harmonics are
transposed to different frequencies. However, while the new
spectral components are no longer integer multiples of a
common fundamental frequency (F0), the spacing between
them remains equal to the original F0. This gives rise to rela-
tively strong pitch percepts which rise and fall in the same
pattern as the pitch of the original speech (Plomp, 1967).
Thus, the shape of intonation contours, and the prosodic in-
formation that they convey, are well preserved after spectral
rotation and may contribute to adaptation. This contrasts
with noise-excited vocoded speech in which voice pitch in-
formation is severely degraded or non-existent, depending
upon the details of the processing (Green et al., 2002; Souza
and Rosen, 2009).
Blesser’s work would appear to suggest that a substan-
tial degree of adaptation to an extreme distortion of spectral
shape and dynamics is possible. This would represent a strik-
ing example of plasticity in the perception of a fundamental
acoustic property essential for speech understanding. Such
plasticity may be of relevance to some users of auditory
prostheses, who may experience severe spectral transforma-
tions, albeit not as extreme as spectral rotation. It may also
have implications in relation to the use of spectral rotation as
a non-speech control in imaging studies. However, the
uncontrolled nature of Blesser’s procedures makes it difficult
to be sure of the underlying processes. For example, it is not
clear to what extent improvements in sentence recognition
over the course of Blesser’s experiment were attributable
specifically to adaptation to altered spectral dynamics, rather
than learning to make use of relatively well preserved fea-
tures of transformed speech, or to increasing familiarity with
test materials and procedures. The latter point is important
since, using spectrally-shifted, noise-vocoded speech,
Stacey
and Summerfield (2008) showed that substantial improve-
ments in sentence recognition and phoneme identification
occurred due to repeated testing without any intervening
training.
Here, we investigate adaptation to spectrally-rotated
speech in more controlled conditions than those used by
Blesser (1969, 1972). We employed Connected Discourse
Tracking (CDT) (De Filippo and Scott, 1978), a training
method previously found effective for spectrally-shifted
speech (Rosen et al., 1999). Regular tests of phoneme and
sentence recognition probed the course of learning and test
conditions were included in which stimuli were manipulated
so as to assess the contribution of features largely unaffected
by spectral rotation, such as the shape of intonation contours
and the presence or absence of periodicity. Based on
Blesser’s findings it is hypothesized that participants who
receive training will show significantly larger improvements
in perception of spectrally-rotated consonants, vowels, and
sentences than participants who receive no training. In addi-
tion, if learning does not involve adaptation to altered spec-
tral shape and dynamics, but merely reflects enhanced use of
speech features preserved by spectral rotation, benefits of
training would still be expected for stimuli in which ampli-
tude envelope, pitch, and periodicity information is pre-
served, while spectral variation is eliminated. For the
1370 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
particular feature of intonation, conversely, performance for
the trained group would be significantly reduced for stimuli
processed so as to eliminate intonation cues.
II. METHODS
A. Participants
Eight normal hearing adults participated after giving
informed consent. All were aged between 20 and 30 and 5
were male. Four of the participants (T1–T4) received train-
ing with spectrally-rotated speech, while the remaining four
did not. One of the untrained participants was bilingual in
English and Gujarati, while the remainder had English as
their sole native language.
B. Speech tests
1. Consonants
Recordings of 20 consonants [m n w r l j b p d t g k t SS
D s f ð z v], in VCV format, spoken by 1 male and 1 female
speaker of Southern Standard British English were available.
Three different vowel contexts were used (/i/, /u/, and /A/),
resulting in a total of 120 stimuli. Participants responded
using a mouse to click on 1 of 20 orthographically-labeled
buttons displayed on a computer screen.
2. Vowels
Recordings of 17 vowels in /bVd/ context, spoken by
the same male and female speakers of Southern Standard
British English were available. There were ten mono-
phthongs [/æ/ (bad), /A+/ (bard), /i+/ (bead), /E/ (bed), /I/
(bid), /˘+/ (bird), /`/ (bod), /O+/ (board), /u+/ (booed), /ˆ/
(bud)] and seven diphthongs [/e@/ (bared), /eI/ (bayed), /I@/
(beard), /aI/ (bide), /@U/ (bode), /aU/ (boughed), /OI/ (Boyd)].
Two tokens from each speaker for each vowel were used,
giving a total of 68 stimuli. Participants responded by click-
ing on 1 of 17 on-screen buttons orthographically labeled
with the full words.
3. Sentences
Two sets of sentence materials were used. One com-
prised video recordings of Bamford-Kowal-Bench (BKB)
sentences (Bench et al., 1979), read by a female speaker of
Southern Standard British English. In some conditions these
were presented audiovisually, while in others only the sound
was presented. The other consisted of audio-only recordings
of the Adaptive Sentence List (ASL) sentences (MacLeod
and Summerfield, 1990) read by a male speaker of Southern
Standard British English. Like BKB sentences, the ASL
materials are short, highly predictable sentences, e.g., “The
bag was very heavy.” The speakers were different from those
who produced the consonant and vowel materials. Twenty-
one BKB lists, each containing 16 sentences and 50 key-
words, and 18 ASL lists, each containing 15 sentences and
45 key words, were used. After a sentence was presented,
participants repeated whatever words they thought they had
heard. The experimenter then recorded the number of key
words correct, applying a loose scoring method in which a
response was scored as correct if its root matched that of the
key word.
C. Signal processing and equipment
Participants listened via Sennheiser headphones (HD 25
SP). Spectral rotation was performed in real time using
the software system Aladdin (Hitech AB, Sweden) and a
digital-signal-processing PC card (Loughborough Sound
Images TMS320C31) running at a sampling rate of
22.05 kHz. Input speech was first low-pass filtered at 4 kHz
using a tenth-order elliptic filter. Additional filtering (33-
point finite impulse response) was applied in order to mini-
mize differences in the long-term spectra of rotated and nor-
mal speech. The design of this additional filter was based
largely on published measurements of the long-term average
speech spectrum (Byrne et al., 1994), although the roll-off
below 120 Hz was ignored, and a flat spectrum below
420 Hz assumed. Spectral rotation around 2 kHz was then
implemented via modulation with a 4-kHz sinusoid. In order
to remove upper frequency side bands, the modulated signal
was low-pass filtered again with the same elliptic filter. The
total root-mean-square level of the spectrally-rotated signal
was set equal to that of the original low-pass filtered signal.
D. Additional manipulations probing the role
of speech features unaffected by spectral rotation
In order to assess the contribution of intonation to the
comprehension of spectrally-rotated speech, sentence tests
were included in which, prior to spectral rotation, stimuli
were manipulated so as to eliminate intonation information.
The Pitch Synchronous Overlap Add technique (Moulines
and Charpentier, 1990), as implemented in Praat (Boersma
and Weenink, 2001) was used to replace the natural pitch
contours of the BKB and ASL sentences with a monotone at
230 Hz for the female talker and at 150 Hz for the male
talker. As a control for possible artifacts introduced by Praat,
a further condition was included in which the natural pitch
contours were shifted, up by 3.5 semitones for the male
talker and down by 3.5 semitones for the female talker.
These conditions will subsequently be referred to respec-
tively as “Monotone” and “Shifted.”
An alternative approach to assessing the relative contri-
butions of spectral dynamics and features unaffected by spec-
tral rotation involved eliminating spectral variation while
preserving amplitude envelope and pitch and periodicity in-
formation. This manipulation, previously used by Faulkner
and Rosen (1999), was implemented in real time in Aladdin
by the use of a second input signal comprising pulses occur-
ring once per pitch period. This pulse input triggered the gen-
eration of a pulse train carrier within the DSP system which
was then modulated by an amplitude envelope extracted from
the speech signal (bandpass filtered between 50 Hz and
3 kHz, 6-dB per octave roll-off). Envelope extraction
employed full-wave rectification and a 32-Hz low-pass filter
(fourth-order elliptic). During unvoiced speech segments a
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1371
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
white-noise carrier was used. Mixed excitation sounds (e.g.,
/z/) led to a voiced output alone. Finally, spectral rotation
was applied as described above. This condition will subse-
quently be referred to as “PP” (for “Pitch and Periodicity”).
A further condition, tested only in two of the trained par-
ticipants (see Sec. II F below), assessed the possibility that
improvements in sentence recognition with training primar-
ily reflected an enhanced ability to take advantage of infor-
mation from the narrow band of frequencies centered on the
frequency around which the spectrum was rotated, which are
relatively unaffected by the transformation. Prior to spectral
rotation, sentences were subject to filtering with steep cutoffs
(Chebyshev type II) to restrict the signal to a single band
centered at 2 kHz. The bandwidth was 240 Hz, correspond-
ing to the equivalent rectangular bandwidth of the auditory
filter (Moore and Glasberg, 1983).
E. Training
CDT was implemented in a similar way to Rosen et al.
(1999). A single female speaker of Southern British English
(one of the authors, R. Paterson) was the talker for all four
trained participants. The materials used were books drawn
from the Heinemann Guided Readers series aimed at learn-
ers of English. The trainer and the participant were seated in
adjacent sound-proof rooms separated by a double-pane
glass partition. The trainer read phrases from the text which
the participant was required to repeat verbatim. Following
an accurate response, the trainer moved on to the next
phrase. If an error was made, the trainer repeated all or part
of the phrase, and the participant responded again. If the
phrase had not been accurately repeated after three attempts,
it was presented unprocessed before moving on to the next
phrase. Performance on the task was assessed in terms of the
rate, in words/min, at which the participant was able to cor-
rectly repeat the phrases spoken by the trainer. A low-level
pink noise was present in the participant’s room to mask any
of the trainer’s natural speech not sufficiently attenuated by
the intervening wall. Approximately half the training was
carried out with the participant able to see the speaker’s
face, while for the remainder, the glass partition between the
two rooms was covered and the participant received only
audio input. The addition of visual input was intended to
expedite learning, particularly in the early stages of training
when performance with the audio signal alone was expected
to be very poor.
F. Procedure
Trained participants completed four sessions of speech
perception testing. The first three sessions took place at
approximately weekly intervals. There was a shorter gap,
typically 3 to 4 days, between the third and fourth sessions.
They underwent a total of 6 h of CDT with spectrally-rotated
speech: 3 h between the first and second testing sessions, and
3 h between the second and third testing sessions. No train-
ing took place between the third and fourth testing sessions.
Training was completed in eight 45-min sessions, divided
into 5-min blocks which alternated between audiovisual and
audio-only presentation, with the exception that the first four
blocks in the initial training session all used audiovisual pre-
sentation. Untrained participants completed the same tests as
the trained group, but over a shorter period of time, typically
within a few days.
The first three testing sessions contained tests of the per-
ception of spectrally-rotated speech, using all the different
speech materials. In each session the 120 VCV stimuli were
presented once and the 68 vowel stimuli twice. Four BKB
sentence lists (female talker) were presented, two audiovisu-
ally and two audio-only. Two ASL sentence lists (male
talker) were presented audio-only. In addition, in the first
testing session vowel and consonant perception were
assessed with stimuli low-pass filtered at 4 kHz, but other-
wise unprocessed. No feedback was given during any of the
testing. The order in which the four different types of tests
were conducted by each participant was based on random-
ized Latin Squares. Sentence lists were chosen at random
(without replacement) for each participant.
The fourth testing session examined speech recognition
in the conditions with additional manipulations intended to
elucidate the factors underlying learning of spectrally-
rotated speech. Sentence recognition was assessed audio-
only in Shifted, Monotone, and PP conditions (one BKB and
two ASL lists in each condition).
In addition, two of the four trained participants (T1 and
T4) carried out additional testing approximately 7 weeks af-
ter the fourth testing session. Sentence recognition was
tested both for spectrally-rotated speech as previously expe-
rienced, to assess retention of learning, and for speech fil-
tered into a single spectral band around 2 kHz. Prior to
testing they were given a brief “reminder” CDT session
comprising one 5-min block of audiovisual training followed
by four 5-min blocks of audio-only training.
III. RESULTS
A. Connected discourse tracking
Figure 1 shows how performance in CDT training
changed over the eight 45-min training sessions. As would
be expected, performance was better in the audiovisual con-
dition than with audio-only presentation, reflecting the avail-
ability of speech-reading cues. In the audiovisual condition,
performance reached a plateau around the fourth training
session at approximately 80 words/min. This is some way
short of the maximum rate previously found with normal
speech and full visual and acoustic cues of between 110 and
130 words/min, depending on the complexity of the training
test (De Filippo and Scott, 1978; Hochberg et al., 1989).
This suggests that although tracking rates ceased to improve
over the final few training sessions, adaptation was not com-
plete. In the audio-only condition performance continued to
improve until the final one or two sessions. The lack of
improvement in the final session may partly reflect the fact
that, during this session, three of the four participants
reached the end of the book that had been used since the start
of training and therefore had to adjust to new material.
Linear regressions on words correct per minute against
training session in audio-only conditions showed that there
was a highly significant increase in performance across
1372 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
session (p < 0.001 for all four participants). For three of the
four participants the regression slopes were similar, corre-
sponding to an increase of between 15 and 20 words/min for
each hour of training. The remaining participant (T3) learned
considerably more slowly with a regression slope showing
an increase of only 4 words/min for every hour of training.
B. Consonants
Consonant identification data is shown in Fig. 2.
Performance was near ceiling with stimuli which were
unprocessed beyond low-pass filtering at 4 kHz. For
spectrally-rotated stimuli, performance was very low in all
three testing sessions for both groups with both talkers, with
no evidence of learning for the trained group. Here, and sub-
sequently, data were analyzed using mixed effects linear
modeling. This enables the amount of training to be treated
as a continuous variable and also allows incorporation of
scores from multiple lists in the same condition. Data from
the spectrally-rotated stimuli were analyzed with factors of
talker (male or female), group (trained or control), and test
session number. No significant main effects or interactions
were observed {main effect of talker [F(1,40) ¼ 1.64,
p ¼ 0.208], interaction between talker and session number
[F(1,40) ¼ 1.17, p ¼ 0.285], all other F’s < 1}.
C. Vowels
As shown in Fig. 3, vowel identification data showed a
similar pattern to consonant identification: Near-ceiling per-
formance on stimuli that were merely low-pass filtered at
4 kHz, very low performance in all tests with spectrally-
rotated stimuli, and no evidence of learning. An analysis
using a mixed effects linear model with factors of group,
talker, and session number showed no significant main
effects or interactions {main effect of group [F(1,88) ¼ 1.07,
p ¼ 0.303], all other F’s < 1}.
D. Sentences
1. Spectral rotation only
Recognition of spectrally-rotated sentences with audio-
only presentation over the first three test sessions will be
examined first (Fig. 4). In contrast to the vowel and conso-
nant data, there was clear evidence of learning. In the first
testing session performance was very low for both trained
and untrained groups. In subsequent testing sessions, how-
ever, while performance remained poor for the untrained
group, it steadily increased for the trained group. Increased
sentence recognition with training was apparent with both
test talkers, but was particularly pronounced for the female
talker.
All four trained participants showed improvements over
the different test sessions although there was considerable
variability in the extent of improvement. Consistent with the
CDT data the least improvement was shown by T3, for
whom the mean proportion of key words correct for the
FIG. 1. Boxplots of performance over the eight training sessions of CDT
with spectrally-rotated speech. Mean words correct per minute, averaged
across the 5 min segments within each training session, are shown for both
audiovisual and audio-only presentation.
FIG. 2. Boxplots of consonant identification for trained and control groups
for the male and female talkers (top and bottom panels, respectively). The
two leftmost boxes show performance with low-pass filtered but otherwise
unprocessed materials obtained in the first testing session (LP). The two
leftmost boxes show performance with low-pass filtered (LP) but otherwise
unprocessed materials obtained in the first testing session.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1373
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
female talker increased from 0.01 to 0.11 over the three test
sessions. For the other three participants the increase in pro-
portion correct for the female talker ranged between 0.26
and 0.64. Benefits from training were somewhat more con-
sistent for the male talker with increases in proportion cor-
rect for all four listeners ranging between 0.09 and 0.22.
Audio-only data from the first three test sessions were
analyzed with a mixed effects linear model with factors of
talker, session, and group. The three-way interaction was
close to significance [F(1,87) ¼ 3.17, p ¼ 0.078], as was the
two-way interaction between talker and session [F(1,87)
¼ 3.42, p ¼ 0.068]. The two-way interaction between talker
and group was not significant [F < 1]. Most importantly, there
was a highly significant two-way interaction between session
and group [F(1,87) ¼ 19.67, p < 0.001], reflecting the fact
that performance did not change for the untrained group, but
increased over time for the trained group. The main effect of
talker was not significant [F(1,87) ¼ 2.69, p ¼ 0.105]. The
significant interaction between group and session means that
analysis of their main effects is of little consequence, but for
completeness we observe that there was no significant effect
of group [F(1,87) ¼ 2.92, p ¼ 0.091], but a highly significant
effect of session [F(1,87) ¼ 23.93, p < 0.001].
With audiovisual presentation (female talker only) there
was also evidence of learning, although there was very large
variability in performance levels before training and ceiling
effects occurred for the trained group. Despite these limita-
tions, a mixed effects linear analysis showed evidence of
learning in the form of a significant interaction between ses-
sion and group [F(1,44) ¼ 5.65, p ¼ 0.022]. Data from audio-
visual conditions will not be considered further.
2. Spectral rotation with additional signal
manipulations
As shown in Fig. 4, performance was also better for the
trained than the untrained group in the Monotone and
Shifted conditions. For the male talker there was little differ-
ence between performance in these conditions and that in the
third test session with spectrally-rotated speech, conducted
after 6 h of training. For the female talker the Monotone con-
dition did produce poorer performance for the trained group
than that in the third test session with spectrally-rotated
speech. However, for three of the four participants a very
similar decrement in performance was also apparent in the
Shifted condition. Since the intonation information con-
tained in the Shifted condition is very similar to that in the
speech subjected only to spectral rotation, it is likely that the
decrement in the Monotone condition was primarily attribut-
able to artifacts of the voice pitch manipulation process,
rather than to the absence of intonation information.
Data from the Monotone and Shifted conditions and
from the third test session with spectrally-rotated speech
were submitted to a mixed effects linear analysis with
factors of group, talker, and condition. There was a highly
significant effect of group [F(1,68) ¼ 37.09, p < 0.001],
but no other main effects or interactions were significant
FIG. 3. Boxplots of vowel identification for trained and control groups for
the male and female talkers (top and bottom panels, respectively). The two
leftmost boxes show performance with low-pass filtered but otherwise
unprocessed materials obtained in the first testing session (LP). The remain-
ing boxes show performance with spectrally-rotated materials obtained over
the first three testing sessions.
FIG. 4. Boxplots of audio-only sentence recognition for the male and female
talkers. Each panel shows (from left to right) performance with spectrally-
rotated sentences in the first three testing sessions, and with pitch-shifted
and monotone sentences from the fourth testing session.
1374 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
{three-way interaction [F(1,68) ¼ 1.56, p ¼ 0.218], interac-
tion between group and talker [F(1,68) ¼ 2.40, p ¼ 0.126],
interaction between group and condition [F(1,68) ¼ 1.17,
p ¼ 0.318], all other F’s < 1}.
The elimination of spectral variation implemented in the
PP condition led to floor effects for both trained and
untrained groups. Across the two groups, 21 out of a total of
24 runs produced zero key words correct.
Figure 5 shows audio-only sentence recognition for the
two trained participants who were available for further test-
ing approximately 7 weeks after completing the main experi-
ment. Their performance in the third testing session is shown
for comparison and it is clear that there was no decrement in
performance over the intervening period. Also shown in Fig.
5 is that performance was at floor when signals were re-
stricted to a single spectral band centered at 2 kHz.
IV. DISCUSSION
Audio-only sentence recognition improved for the
trained group but not for the untrained group. This confirms
that adaptation to spectrally-rotated speech is possible and
shows that the improvement was not attributable to repeated
exposure to the test procedures. Improved sentence recogni-
tion after training was observed for two different talkers,
both of whom were different to the training talker, indicating
generalization of learning across talkers. Recognition of
spectrally-rotated sentences in two subjects available for
follow-up testing did not differ from that obtained immedi-
ately after training, suggesting that learning was robust over
a period of several weeks. Data obtained with additional
stimulus manipulations, discussed in more detail below,
were consistent with the idea that improvements with
training did not simply reflect better use of those, primarily
temporal, speech features that are relatively well preserved
after spectral rotation. These findings demonstrate that quite
rapid adaptation is possible to a radical transformation of the
representation of critical speech spectral information.
Spectral rotation is clearly not directly comparable to the
transformations that might be experienced by users of audi-
tory prostheses. However, this does suggest that non-
monotonic transformations of spectral information, such as
might be experienced particularly by ABI users, do not, per
se, preclude the regaining of substantial levels of speech
recognition.
It was conceivable that improvements after training
might involve participants learning to make better use of
speech information that was relatively unaffected by the
transformation, such as periodicity and envelope, while
ignoring distorted spectral information. However, this was
not supported by testing with stimulus manipulations that,
prior to spectral rotation, selectively eliminated or preserved
different aspects of speech information. Sentence recogni-
tion for the two re-tested participants was at floor for stimuli
restricted to the central frequency band around 2 kHz that
was largely unaffected by spectral rotation, showing that
they had not learned simply to ignore information from
transformed spectral regions. Performance was similarly
poor for stimuli in which spectral variation was eliminated,
while pitch, periodicity, and amplitude envelope were pre-
served. This suggests that post-training improvements in per-
ception of spectrally-rotated speech did reflect adaptation to
altered spectral information and is consistent with the notion
that access to spectral dynamics is critical to speech under-
standing (Rosen and Iverson, 2007). In addition, sentence
recognition was not significantly poorer when natural F0
contours were replaced with monotones. Therefore, the pres-
ervation of intonation contour shape does not appear to be
critical to the comprehension of spectrally-rotated speech af-
ter training. It remains possible, however, that the presence
of near natural intonation patterns may be important during
the learning process through, for example, providing helpful
cues to segmentation and syntactic structure.
Somewhat surprisingly, despite the improvements in
CDT rates and sentence recognition with training, there were
no significant improvements in medial vowel or intervocalic
consonant identification, suggesting a particularly important
role for contextual information (Boothroyd and Nittrouer,
1988). One possibility is that the multiple constraints on
word choices provided by simple, predictable sentences
make these materials sensitive to improvements in the per-
ception of speech features that are too small to be manifested
in tests of isolated phoneme identification. However, it
should be noted that this aspect of our data contrasts with
Blesser (1969, 1972), where there were improvements in
both vowel and consonant recognition. Methodological fac-
tors might be important here. For example, Blesser’s partici-
pants completed tests of vowel and consonant discrimination
prior to identification tests. It should, however, also be noted
that the improvements in vowel or consonant identification
observed by Blesser were quite small and that there was little
correlation with improvements in sentence recognition.
FIG. 5. Boxplots of audio-only sentence recognition for the male and female
talkers from the additional testing carried out with two trained participants,
7 weeks after the end of the initial testing. This included simple spectral
rotation and also conditions in which stimuli were restricted to a single fre-
quency band centered at 2 kHz. For comparison the performance of these
two participants with spectrally-rotated sentences in the third testing session
is shown on the left-hand side of the figure.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1375
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
Studies of the learning of spectrally-shifted noise- or
tone-vocoded speech have also produced an inconsistent pic-
ture regarding the relationship between sentence recognition
and phoneme identification, measured using VCV and bVd
or hVd materials. Using CDT training Rosen et al. (1999)
found large improvements in recognition of BKB sentences
and also significant improvements in vowel and consonant
identification. Fu et al. (2005) implemented an interactive,
computer-based training method using unconnected HINT
sentences (Nilsson et al., 1994) and found significant
improvements for consonant, but not vowel, identification.
Unfortunately, Fu et al. (2005) did not include sentence tests,
but large increases in sentence recognition during training
were reported. Using a similar training method Stacey and
Summerfield (2008) found significant improvements in rec-
ognition of BKB and IEEE sentences, but no significant
effect of training on either vowel or consonant identification.
Sentence recognition, of course, involves a wider range
of cognitive skills operating on lexical, semantic, and syntac-
tic information, which introduces more variability across
participants. It may also be relevant that the two aforemen-
tioned studies that showed significant improvement for both
vowel and consonant identification (Blesser, 1972; Rosen
et al. , 1999) used a single talker for each type of test material
whereas the remainder, including the present study, used
multiple talkers within vowel and consonant tests. It may be
that trial-to-trial variation of talkers depresses performance
in tests of brief isolated phonemes which provide very little
time to adjust to a different talker.
The finding that learning generalized across talkers
indicates that adaptation was occurring at a level of abstrac-
tion beyond the particular acoustic-phonetic patterns pro-
duced by the training talker. Generalization to new talkers
would likely be increased by using a number of different
talkers during training (Stacey and Summerfield, 2007).
Further research might explore which aspects of speech
processing are being modified during learning of spectrally-
rotated speech. In the context of vocoded or synthetic
speech, there has been considerable research examining the
transfer of perceptual learning of speech to contexts that
were not experienced in training, providing information
about the levels of processing at which training related
changes occur (e.g., Francis et al., 2007; Dahan and Mead,
2010, Hervais-Adelman et al., 2011). Such work suggests
that while there is more or less complete generalization
across some acoustic features, there are also context-
dependent aspects of learning. For example, Hervais-
Adelman et al. (2011) showed that learning of vocoded
speech transferred to an untrained frequency region, but that
learning only partly generalized across different carrier sig-
nals used in vocoding. Dahan and Mead (2010) found that
after training with noise-vocoded monosyllables, perception
of consonants in test stimuli differed according to whether
they appeared in the same position, or flanked by the same
vowel, as in the training stimuli. There is also evidence that
lexical information plays an important role in learning of
vocoded speech (Stacey and Summerfield, 2008), but that
semantic context is not essential for learning (Davis et al.,
2005; Loebach et al., 2010).
It would appear reasonable to expect a considerable
overlap between the processes involved in adapting to
spectrally-shifted vocoded speech and spectrally-rotated
speech. However, the transformations do differ substantially,
e.g., in the contrasting extent to which information about
intonation and spectral dynamics is preserved, and it remains
possible that there may be aspects of learning that are spe-
cific to spectral rotation. Other unresolved issues include the
extent to which further improvements in speech perception
might be possible with long-term experience with extreme
spectral transformations, and the degree to which short-term
adaptation might occur with passive exposure to spectral
rotation rather than specific training. However, it does
appear that some caution may be necessary in the treatment
of spectrally-rotated speech as a non-speech control in neu-
roimaging research, in particular when relatively long expo-
sure periods are used.
V. CONCLUSIONS
Considerable adaptation to an extreme form of distor-
tion of spectral information was possible with a few hours
experience. Learning did appear to involve adaptation to
altered spectral shape and dynamics. The fact that intona-
tional contrasts are well preserved after the transformation
did not appear to be important for the comprehension of
spectrally-rotated speech after training, though it remains
possible that intonation might contribute to the process of
adaptation.
ACKNOWLEDGMENT
This work was partially supported by Action on Hearing
Loss (Grant No. G53).
Azadpour, M., and Balaban, E. (2008). “Phonological representations are
unconsciously used when processing complex, non-speech signals,” PLoS
ONE 3, e1966.
Bench, J., Kowal, A., and Bamford, J. (1979). “The BKB (Bamford-Kowal-
Bench) sentence lists for partially-hearing children,” Br. J. Audiol. 13,
108–112.
Blesser, B. (1969). “Perception of spectrally rotated speech,” Ph.D. disserta-
tion, Massachusetts Institute of Technology, Cambridge, MA.
Blesser, B. (1972). “Speech perception under conditions of spectral transfor-
mation. 1. Phonetic characteristics,” J. Speech Hear. Res. 15, 5–41.
Boersma, P., and Weenink, D. (2001). “Praat: Doing phonetics by computer
(version 3.9.28) [computer program],” http://www.praat.org (Last viewed
May 2001).
Boothroyd, A., and Nittrouer, S. (1988). “Mathematical treatment of context
effects in phoneme and word recognition,” J. Acoust. Soc. Am. 84,
101–114.
Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R.,
Hagerman, B., Hetu, R., Kei, J., Lui, C., Kiessling, J., Kotby, M. N.,
Nasser, N. H. A., El Kholy, W. A. H., Nakanishi, Y., Oyer, H., Powell, R.,
Stephens, D., Meredith, R., Sirimanna, T., Tavartkiladze, G., Frolenkov,
G. I., Westerman, S., and Ludvigsen, C. (1994). “An international compar-
ison of long-term average speech spectra,” J. Acoust. Soc. Am. 96,
2108–2120.
Colletti, L., Shannon, R., and Colletti, V. (2012). “Auditory brainstem
implants for neurofibromatosis type 2,” Curr. Opin. Otolaryngol. Head
Neck Surg. 20, 353–357.
Dahan, D., and Mead, R. L. (2010). “Context-conditioned generalization in
adaptation to distorted speech,” J. Exp. Psychol. Hum. Percept. Perform.
36, 704–728.
Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., and
McGettigan, C. (2005). “Lexical information drives perceptual learning of
1376 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
distorted speech: Evidence from the comprehension of noise-vocoded
sentences,” J. Exp. Psychol. Gen. 134, 222–241.
De Filippo, C., and Scott, B. L. (1978). “Method for training and evaluating
reception of ongoing speech,” J. Acoust. Soc. Am. 63, 1186–1192.
Dorman, M. F., Loizou, P. C., and Rainey, D. (1997a). “Speech intelligibil-
ity as a function of the number of channels of stimulation for signal pro-
cessors using sine-wave and noise-band outputs,” J. Acoust. Soc. Am.
102, 2403–2411.
Dorman, M. F., Loizou, P. C., and Rainey, D. (1997b). “Simulating the
effect of cochlear-implant electrode insertion depth on speech under-
standing,” J. Acoust. Soc. Am. 102, 2993–2996.
Dupoux, E., and Green, K. (1997). “Perceptual adjustment to highly com-
pressed speech: Effects of talker and rate changes,” J. Exp. Psychol. Hum.
Percept. Perform. 23, 914–927.
Faulkner, A., and Rosen, S. (1999). “Contributions of temporal encodings of
voicing, voicelessness, fundamental frequency, and amplitude variation to
audio-visual and auditory speech perception,” J. Acoust. Soc. Am. 106,
2063–2073.
Faulkner, A., Rosen, S., and Norman, C. (2006). “The right information may
matter more than frequency-place alignment: Simulations of frequency-
aligned and upward shifting cochlear implant processors for a shallow
electrode array insertion,” Ear Hear. 27, 139–152.
Finley, C. C., Holden, T. A., Holden, L. K., Whiting, B. R., Chole, R. A.,
Neely, G. J., Hullar, T. E., and Skinner, M. W. (2008). “Role of electrode
placement as a contributor to variability in cochlear implant outcomes,”
Otol. Neurotol. 29, 920–928.
Francis, A. L., Nusbaum, H. C., and Fenn, K. (2007). “Effects of training on
the acoustic-phonetic representation of synthetic speech,” J. Speech Lang.
Hear. Res. 50, 1445–1465.
Fu, Q.-J., and Galvin, J. J. (2003). “The effects of short-term training for
spectrally mismatched noise-band speech,” J. Acoust. Soc. Am. 113,
1065–1072.
Fu, Q.-J., Nogaki, G., and Galvin, J. J. (2005). “Auditory training with
spectrally-shifted speech: Implications for cochlear implant patient audi-
tory rehabilitation,” J. Assoc. Res. Otolaryngol. 6, 180–189.
Green, T., Faulkner, A., and Rosen, S. (2002). “Spectral and temporal cues
to pitch in noise-excited vocoder simulations of continuous-interleaved-
sampling cochlear implants,” J. Acoust. Soc. Am. 112, 2155–2164.
Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., and Carlyon, R. P.
(2008). “Perceptual learning of noise vocoded words: Effects of feed-
back and lexicality,” J. Exp. Psychol. Hum. Percept. Perform. 34,
460–474.
Hervais-Adelman, A. G., Davis, M. H., Johnsrude, I. S., Taylor, K. J., and
Carlyon, R. P. (2011). “Generalization of perceptual learning of vocoded
speech,” J. Exp. Psychol. Hum. Percept. Perform. 37, 283–295.
Hill, J., McRae, P., and McClellan, R. (1968). “Speech recognition as a
function of channel capacity in a discrete set of channels,” J. Acoust. Soc.
Am. 44, 13–18.
Hochberg, I., Rosen, S., and Ball, V. (1989). “Effect of text complexity on
connected discourse tracking rate,” Ear Hear. 10, 192–199.
Ketten, D. R., Vannier, M. W., Skinner, M. W., Gates, G. A., Wang, G., and
Neely, J. G. (1998). “In vivo measures of cochlear length and insertion
depth of nucleus cochlear implant electrode arrays,” Ann. Otol. Rhin.
Laryngol. 107, 1–16.
Loebach, J. L., Pisoni, D. B., and Svirsky, M. A. (2010). “Effects of seman-
tic context and feedback on perceptual learning of speech processed
through an acoustic simulation of a cochlear implant,” J. Exp. Psychol.
Hum. Percept. Perform. 36, 224–234.
MacLeod, A., and Summerfield, A. Q. (1990). “A procedure for measuring
auditory and audio-visual speech-reception thresholds for sentences in
noise: Rationale, evaluation, and recommendations for use,” Br. J. Audiol.
24, 29–43.
Moore, B. C. J., and Glasberg, B. R. (1983). “Suggested formulae for calcu-
lating auditory filter bandwidths and excitation patterns,” J. Acoust. Soc.
Am. 74, 750–753.
Moulines, E., and Charpentier, F. (1990). “Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones,”
Speech Commun. 9, 453–467.
Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “Development of the
hearing in noise test for the measurement of speech reception thresholds in
quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099.
Pallier, C., Sebastian-Gall
es, N., Dupoux, E., Christophe, A., and Mehler, J.
(1998). “Perceptual adjustment to time-compressed speech: A cross-
linguistic study,” Mem. Cognit. 26, 844–851.
Plomp, R. (1967). “Pitch of complex tones,” J. Acoust. Soc. Am. 41,
1526–1533.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981). “Speech
perception without traditional speech cues,” Science 212, 947–950.
Rosen, S., Faulkner, A., and Wilkinson, L. (1999). “Adaptation by normal
listeners to upward spectral shifts of speech: Implications for cochlear
implants,” J. Acoust. Soc. Am. 106, 3629–3636.
Rosen, S., and Iverson, P. (2007). “Constructing adequate non-speech ana-
logues: what is special about speech anyway?” Dev. Sci. 10, 165–168.
Scott, S. K., Blank, C. C., Rosen, S., and Wise, R. J. S. (2000).
“Identification of a pathway for intelligible speech in the left temporal
lobe,” Brain 123, 2400–2406.
Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., and Ekelid, M.
(1995). “Speech recognition with primarily temporal cues,” Science 270,
303–304.
Shannon, R. V., Zeng, F.-G., and Wygonski, J. (1998
). “Speech recognition
with altered spectral distribution of envelope cues,” J. Acoust. Soc. Am.
104, 2467–2476.
Smith, M. W., and Faulkner, A. (2006). “Perceptual adaptation by normally
hearing listeners to a simulated ‘hole’ in hearing,” J. Acoust. Soc. Am.
120, 4019–4030.
Souza, P., and Rosen, S. (2009). “Effects of envelope bandwidth on the
intelligibility of sine- and noise-vocoded speech,” J. Acoust. Soc. Am.
126, 792–805.
Stacey, P. C., and Summerfield, A. Q. (2007). “Effectiveness of computer-
based auditory training in improving the perception of noise-vocoded
speech,” J. Acoust. Soc. Am. 121, 2923–2935.
Stacey, P. C., and Summerfield, A. Q. (2008). “Comparison of word-, sen-
tence-, and phoneme-based training strategies in improving the perception
of spectrally distorted speech,” J. Speech Lang. Hear. Res. 51, 526–538.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1377
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
... It follows, then, that more experience with a degraded signal should enhance perception of that signal in subsequent encounters. Empirical support for this argument comes from literature documenting perceptual learning of spectrally rotated (Green et al., 2013) and time-compressed (Banai & Lavner, 2016) speech, in which learning outcomes were optimized with more exposure to the degraded speech signal. In addition, listeners familiarized with talkers with spastic and athetoid dysarthria, secondary to cerebral palsy, over multiple sessions, demonstrated gradual improvements in consonant identification (Kim, 2015). ...
... It currently remains unknown if listeners of an unpredictable talker would benefit from perceptual training extended over multiple training sessions. Previous studies have, indeed, demonstrated gradual improvement in perception of synthetic (Fenn et al., 2003), spectrally rotated (Green et al., 2013), time-compressed (Banai & Lavner, 2016), and foreign-accented (Earle & Myers, 2015;Xie et al., 2018) speech, over training sessions that spanned multiple days. Listeners from these experiments, then, benefitted not only Figure 2. The average intelligibility scores, indexed by percent words correct, for both pretest and posttest for each intelligibility (percent words correct) for the somatosensory and somatosensory + lexical conditions in Experiment 2, with the error bars representing ± 1 SE. ...
Article
Full-text available
Purpose Robust improvements in intelligibility following familiarization, a listener-targeted perceptual training paradigm, have been revealed for talkers diagnosed with spastic, ataxic, and hypokinetic dysarthria but not for talkers with hyperkinetic dysarthria. While the theoretical explanation for the lack of intelligibility improvement following training with hyperkinetic talkers is that there is insufficient distributional regularity in the speech signals to support perceptual adaptation, it could simply be that the standard training protocol was inadequate to facilitate learning of the unpredictable talker. In a pair of experiments, we addressed this possible alternate explanation by modifying the levels of exposure and feedback provided by the perceptual training protocol to offer listeners a more robust training experience. Method In Experiment 1, we examined the exposure modifications, testing whether perceptual adaptation to an unpredictable talker with hyperkinetic dysarthria could be achieved with greater or more diverse exposure to dysarthric speech during the training phase. In Experiment 2, we examined feedback modifications, testing whether perceptual adaptation to the unpredictable talker could be achieved with the addition of internally generated somatosensory feedback, via vocal imitation, during the training phase. Results Neither task modification led to improved intelligibility of the unpredictable talker with hyperkinetic dysarthria. Furthermore, listeners who completed the vocal imitation task demonstrated significantly reduced intelligibility at posttest. Conclusion Together, the results from Experiments 1 and 2 replicate and extend findings from our previous work, suggesting perceptual adaptation is inhibited for talkers whose speech is largely characterized by unpredictable degradations. Collectively, these results underscore the importance of integrating signal predictability into theoretical models of perceptual learning.
... While significant benefits were observed for the present computer-based phonemic contrast training, other approaches have also been shown to be beneficial. Previous studies have shown significant training benefits using a connected discourse training protocol [30,31]; similar benefits were observed between labor-intensive in-person and computer-based connected discourse training [32]. Oba et al. [11] found significant benefits for digits-in-noise training. ...
Article
Full-text available
For French cochlear implant (CI) recipients, in-person clinical auditory rehabilitation is typically provided during the first few years post-implantation. However, this is often inconvenient, it requires substantial time resources and can be problematic when appointments are unavailable. In response, we developed a computer-based home training software ("French AngelSound™") for French CI recipients. We recently conducted a pilot study to evaluate the newly developed French AngelSound™ in 15 CI recipients (5 unilateral, 5 bilateral, 5 bimodal). Outcome measures included phoneme recognition in quiet and sentence recognition in noise. Unilateral CI users were tested with the CI alone. Bilateral CI users were tested with each CI ear alone to determine the poorer ear to be trained, as well as with both ears (binaural performance). Bimodal CI users were tested with the CI ear alone, and with the contralateral hearing aid (binaural performance). Participants trained at home over a one-month period (10 hours total). Phonemic contrast training was used; the level of difficulty ranged from phoneme discrimination in quiet to phoneme identification in multi-talker babble. Unilateral and bimodal CI users trained with the CI alone; bilateral CI users trained with the poorer ear alone. Outcomes were measured before training (pre-training), immediately after training was completed (post-training), and one month after training was stopped (follow-up). For all participants, post-training CI-only vowel and consonant recognition scores significantly improved after phoneme training with the CI ear alone. For bilateral and bimodal CI users, binaural vowel and consonant recognition scores also significantly improved after training with a single CI ear. Follow-up measures showed that training benefits were largely retained. These preliminary data suggest that the phonemic contrast training in French AngelSound™ may significantly benefit French CI recipients and may complement clinical auditory rehabilitation, especially when in-person visits are not possible.
... There is some indication that even rapid learning of speech is maintained over time 32,33 . In contrast, results for the specificity of either rapid or training-induced perceptual learning of connected speech are mixed 5,6,16,18,28,[34][35][36][37][38][39][40] . For time-compressed speech, some studies found that learning was not specific to the compression rate and even transferred from time-compressed to natural-fast speech 5,37 , but others did not 28,29,41 . ...
Article
Full-text available
Perceptual learning for speech, defined as long-lasting changes in speech recognition following exposure or practice occurs under many challenging listening conditions. However, this learning is also highly specific to the conditions in which it occurred, such that its function in adult speech recognition is not clear. We used a time-compressed speech task to assess learning following either brief exposure (rapid learning) or additional training (training-induced learning). Both types of learning were robust and long-lasting. Individual differences in rapid learning explained unique variance in recognizing natural-fast speech and speech-in-noise with no additional contribution for training-induced learning (Experiment 1). Rapid learning was stimulus specific (Experiment 2), as in previous studies on training-induced learning. We suggest that rapid learning is key for understanding the role of perceptual learning in online speech recognition whereas longer training could provide additional opportunities to consolidate and stabilize learning.
... Azadpour and Balaban (2015) favoured a cue-weighting explanation in their study of spectrally rotated speech. In contrast, Green et al. (2013), also using spectrally rotated speech, argued that improved intelligibility involved adaptation to altered acoustical properties (in their case, spectral shape and dynamics) and found no evidence for weighting information that was relatively unaffected by the distortion (viz., intonational contrasts). ...
Article
Full-text available
When confronted with unfamiliar or novel forms of speech, listeners' word recognition performance is known to improve with exposure, but data are lacking on the fine-grained time course of adaptation. The current study aims to fill this gap by investigating the time course of adaptation to several different types of distorted speech. Keyword scores as a function of sentence position in a block of 30 sentences were measured in response to eight forms of distorted speech. Listeners recognised twice as many words in the final sentence compared to the initial sentence with around half of the gain appearing in the first three sentences, followed by gradual gains over the rest of the block. Rapid adaptation was apparent for most of the eight distortion types tested with differences mainly in the gradual phase. Adaptation to sine-wave speech improved if listeners had heard other types of distortion prior to exposure, but no similar facilitation occurred for the other types of distortion. Rapid adaptation is unlikely to be due to procedural learning since listeners had been familiarised with the task and sentence format through exposure to undistorted speech. The mechanisms that underlie rapid adaptation are currently unclear.
... Connected speech recognition under adverse conditions (e.g., distortion, background noise) [1], improves substantially following brief experiences and prolonged practice [2][3][4][5][6][7][8][9]. These improvements re ect perceptual learning, de ned as relatively long-lasting changes in the ability to extract information from the environment following experience or practice [10,11]. ...
Preprint
Full-text available
Perceptual learning, defined as long-lasting changes in the ability to extract information from the environment, occurs following either brief exposure or prolonged practice. Whether these two types of experience yield qualitatively distinct patterns of learning is not clear. We used a time-compressed speech task to assess perceptual learning following either rapid exposure or additional training. We report that both experiences yielded robust and long-lasting learning. Individual differences in rapid learning explained unique variance in performance in independent speech tasks (natural-fast speech and speech-in-noise) with no additional contribution for training-induced learning (Experiment 1). Finally, it seems that similar factors influence the specificity of the two types of learning (Experiment 1 and 2). We suggest that rapid learning is key for understanding the role of perceptual learning in speech recognition under adverse conditions while longer learning could serve to strengthen and stabilize learning.
... The acoustic signal was first equalized with a filter (essentially high-pass) that gave the rotated signal approximately the same long-term spectrum as the original. This equalizing filter (33-point finite impulse response [FIR]) was constructed based on measurements of the long-term average spectrum of speech (Byrne et al. 1994), although the roll-off below 120 Hz was ignored, and a flat spectrum below 420 Hz was assumed (Scott, Rosen, et al. 2009;Green et al. 2013). The equalized signal was then amplitude modulated by a sinusoid at 4 kHz, followed by low-pass filtering at 3.8 kHz. ...
Article
Full-text available
Humans can generate mental auditory images of voices or songs, sometimes perceiving them almost as vividly as perceptual experiences. The functional networks supporting auditory imagery have been described, but less is known about the systems associated with interindividual differences in auditory imagery. Combining voxel-based morphometry and fMRI, we examined the structural basis of interindividual differences in how auditory images are subjectively perceived, and explored associations between auditory imagery, sensory-based processing, and visual imagery. Vividness of auditory imagery correlated with gray matter volume in the supplementary motor area (SMA), parietal cortex, medial superior frontal gyrus, and middle frontal gyrus. An analysis of functional responses to different types of human vocalizations revealed that the SMA and parietal sites that predict imagery are also modulated by sound type. Using representational similarity analysis, we found that higher representational specificity of heard sounds in SMA predicts vividness of imagery, indicating a mechanistic link between sensory- and imagery-based processing in sensorimotor cortex. Vividness of imagery in the visual domain also correlated with SMA structure, and with auditory imagery scores. Altogether, these findings provide evidence for a signature of imagery in brain structure, and highlight a common role of perceptual-motor interactions for processing heard and internally generated auditory information.
... Although brief exposure is sufficient to initiate the perceptual learning of distorted speech, additional multi-session practice nevertheless yields additional learning (Banai and Lavner, 2014;Green et al., 2013;Song et al., 2012;Stacey and Summerfield, 2008). In the case of time-compressed speech, previous studies suggest that learning continues through several training sessions, each consisting of several hundred sentences in both L1 and L2 listeners Lavner, 2012, 2014). ...
Article
The present study investigated the effects of language experience on the perceptual learning induced by either brief exposure to or more intensive training with time-compressed speech. Native (n = 30) and nonnative (n = 30) listeners were each divided to three groups with different experiences with time-compressed speech: A trained group who trained on the semantic verification of time-compressed sentences for three sessions, an exposure group briefly exposed to 20 time-compressed sentences, and a group of naive listeners. Recognition was assessed with three sets of time-compressed sentences intended to evaluate exposure-induced and training-induced learning as well as across-token and across-talker generalization. Learning profiles differed between native and nonnative listeners. Exposure had a weaker effect in nonnative than in native listeners. Furthermore, native and nonnative trained listeners significantly outperformed their untrained counterparts when tested with sentences taken from the training set. However, only trained native listeners outperformed naive native listeners when tested with new sentences. These findings suggest that the perceptual learning of speech is sensitive to linguistic experience. That rapid learning is weaker in nonnative listeners is consistent with their difficulties in real-life conditions. Furthermore, nonnative listeners may require longer periods of practice to achieve native-like learning outcomes.
... It has a largely unchanged pitch profile, where some vowels remain relatively unchanged and some voice and manner cues are preserved. However, it is still unintelligible without significant training (Green, Rosen, Faulkner, & Paterson, 2013;Azadpour & Balaban, 2008;Blesser, 1972). ...
Article
Full-text available
Spoken conversations typically take place in noisy environments, and different kinds of masking sounds place differing demands on cognitive resources. Previous studies, examining the modulation of neural activity associated with the properties of competing sounds, have shown that additional speech streams engage the superior temporal gyrus. However, the absence of a condition in which target speech was heard without additional masking made it difficult to identify brain networks specific to masking and to ascertain the extent to which competing speech was processed equivalently to target speech. In this study, we scanned young healthy adults with continuous fMRI, while they listened to stories masked by sounds that differed in their similarity to speech. We show that auditory attention and control networks are activated during attentive listening to masked speech in the absence of an overt behavioral task. We demonstrate that competing speech is processed predominantly in the left hemisphere within the same pathway as target speech but is not treated equivalently within that stream and that individuals who perform better in speech in noise tasks activate the left mid-posterior superior temporal gyrus more. Finally, we identify neural responses associated with the onset of sounds in the auditory environment; activity was found within right lateralized frontal regions consistent with a phasic alerting response. Taken together, these results provide a comprehensive account of the neural processes involved in listening in noise.
Article
Voices are the most relevant social sounds for humans and therefore have crucial adaptive value in development. Neuroimaging studies in adults have demonstrated the existence of regions in the superior temporal sulcus that respond preferentially to voices. Yet, whether voices represent a functionally specific category in the young infant’s mind is largely unknown. We developed a highly sensitive paradigm relying on fast periodic auditory stimulation (FPAS) combined with scalp electroencephalography (EEG) to demonstrate that the infant brain implements a reliable preferential response to voices early in life. Twenty-three 4-month-old infants listened to sequences containing non-vocal sounds from different categories presented at 3.33 Hz, with highly heterogeneous vocal sounds appearing every third stimulus (1.11 Hz). We were able to isolate a voice-selective response over temporal regions, and individual voice-selective responses were found in most infants within only a few minutes of stimulation. This selective response was significantly reduced for the same frequency-scrambled sounds, indicating that voice selectivity is not simply driven by the envelope and the spectral content of the sounds. Such a robust selective response to voices as early as 4 months of age suggests that the infant brain is endowed with the ability to rapidly develop a functional selectivity to this socially relevant category of sounds.
Article
Individuals who are speaking in a second language tend to use the language in ways that differ from native speakers. As listeners build representations of nonnative‐accented speech, the need for explicit processing should decrease and fewer attentional resources should be necessary for listeners to access the lexical items intended by the nonnative speakers. A growing body of work suggests that listeners can adapt to nonnative speech after both long‐ and short‐term exposure to these speech varieties. The influence of both accent strength and listener experience on accuracy and processing speed were gradient and nonlinear. An important issue that has drawn more attention is how accent adaptation may change across the life span. Many divergences from native norms in nonnative speech involve shifts in category boundaries rather than category mismatches. Although nonnative speech introduces variability into the speech signal, it is well established that native speech also contains substantial variability.
Article
Full-text available
The long-term average speech spectrum (LTASS) and some dynamic characteristics of speech were determined for 12 languages: English (several dialects), Swedish, Danish, German, French (Canadian), Japanese, Cantonese, Mandarin, Russian, Welsh, Singhalese, and Vietnamese. The LTASS only was also measured for Arabic. Speech samples (18) were recorded, using standardized equipment and procedures, in 15 localities for (usually) ten male and ten female talkers. All analyses were conducted at the National Acoustic Laboratories, Sydney. The LTASS was similar for all languages although there were many statistically significant differences. Such differences were small and not always consistent for male and female samples of the same language. For one-third octave bands of speech, the maximum short-term rms level was 10 dB above the maximum long-term rms level, consistent across languages and frequency. A ''universal'' LTASS is suggested as being applicable, across languages, for many purposes including use in hearing aid prescription procedures and in the Articulation Index.
Article
Full-text available
Previous research has shown that, when hearers listen to artificially speeded speech, their performance improves over the course of 10–15 sentences, as if their perceptual system was “adapting” to these fast rates of speech. In this paper, we further investigate the mechanisms that are responsible for such effects. In Experiment 1, we report that, for bilingual speakers of Catalan and Spanish, exposure to compressed sentences in either language improves performance on sentences in the other language. Experiment 2 reports that Catalan/Spanish transfer of performance occurs even in monolingual speakers of Spanish who do not understand Catalan. In Experiment 3, we study another pair of languages— namely, English and French—and report no transfer of adaptation between these two languages for English—French bilinguals. Experiment 4, with monolingual English speakers, assesses transfer of adaptation from French, Dutch, and English toward English. Here we find that there is no adaptation from French and intermediate adaptation from Dutch. We discuss the locus of the adaptation to compressed speech and relate our findings to other cross-linguistic studies in speech perception.
Article
Full-text available
Multi-channel cochlear implants typically present spectral information to the wrong "place" in the auditory nerve array, because electrodes can only be inserted partway into the cochlea. Although such spectral shifts are known to cause large immediate decrements in performance in simulations, the extent to which listeners can adapt to such shifts has yet to be investigated. Here, the effects of a four-channel implant in normal listeners have been simulated, and performance tested with unshifted spectral information and with the equivalent of a 6.5-mm basalward shift on the basilar membrane (1.3-2.9 octaves, depending on frequency). As expected, the unshifted simulation led to relatively high levels of mean performance (e.g., 64% of words in sentences correctly identified) whereas the shifted simulation led to very poor results (e.g., 1% of words). However, after just nine 20-min sessions of connected discourse tracking with the shifted simulation, performance improved significantly for the identification of intervocalic consonants, medial vowels in monosyllables, and words in sentences (30% of words). Also, listeners were able to track connected discourse of shifted signals without lipreading at rates up to 40 words per minute. Although we do not know if complete adaptation to the shifted signals is possible, it is clear that short-term experiments seriously exaggerate the long-term consequences of such spectral shifts.
Article
Full-text available
Recent work demonstrates that learning to understand noise-vocoded (NV) speech alters sublexical perceptual processes but is enhanced by the simultaneous provision of higher-level, phonological, but not lexical content (Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008), consistent with top-down learning (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005; Hervais-Adelman et al., 2008). Here, we investigate whether training listeners with specific types of NV speech improves intelligibility of vocoded speech with different acoustic characteristics. Transfer of perceptual learning would provide evidence for abstraction from variable properties of the speech input. In Experiment 1, we demonstrate that learning of NV speech in one frequency region generalizes to an untrained frequency region. In Experiment 2, we assessed generalization among three carrier signals used to create NV speech: noise bands, pulse trains, and sine waves. Stimuli created using these three carriers possess the same slow, time-varying amplitude information and are equated for naïve intelligibility but differ in their temporal fine structure. Perceptual learning generalized partially, but not completely, among different carrier signals. These results delimit the functional and neural locus of perceptual learning of vocoded speech. Generalization across frequency regions suggests that learning occurs at a stage of processing at which some abstraction from the physical signal has occurred, while incomplete transfer across carriers indicates that learning occurs at a stage of processing that is sensitive to acoustic features critical for speech perception (e.g., noise, periodicity).
Article
Full-text available
People were trained to decode noise-vocoded speech by hearing monosyllabic stimuli in distorted and unaltered forms. When later presented with different stimuli, listeners were able to successfully generalize their experience. However, generalization was modulated by the degree to which testing stimuli resembled training stimuli: Testing stimuli's consonants were easier to recognize when they had occurred in the same position at training, or flanked by the same vowel, than when they did not. Furthermore, greater generalization occurred when listeners had been trained on existing words than on nonsense strings. We propose that the process by which adult listeners learn to interpret distorted speech is akin to building phonological categories in one's native language, a process where categories and structure emerge from the words in the ambient language without completely abstracting from them.
Article
The strategy for measuring speech-reception thresholds for sentences in noise advocated by Plomp and Mimpen (Audiology, 18, 43–52, 1979) was modified to create a reliable test for measuring the difficulty which listeners have in speech reception, both auditorily and audio-visually. The test materials consist of 10 lists of 15 short sentences of homogeneous intelligibility when presented acoustically, and of different, but still homogeneous, intelligibility when presented audio-visually, in white noise. Homogeneity was achieved by applying phonetic and linguistic principles at the stage of compilation, followed by pilot testing and balancing of properties. To run the test, lists are presented at signal-to-noise ratios (SNRs) determined by an up-down psychophysical rule so as to estimate auditory and audiovisual speech-reception thresholds, defined as the SNRs at which the three content words in each sentence are identified correctly on 50% of trials. These thresholds provide measures of a subject's speech-reception abilities. The difference between them provides a measure of the benefit received from vision. It is shown that this measure is closely related to the accuracy with which subjects lip-read words in sentences with no acoustical information. In data from normally hearing adults, the standard deviations (s.d.s) of estimates of auditory speech reception threshold in noise (SRTN), audio-visual SRTN, and visual benefit are 1.2, 2.0, and 2.3 dB, respectively. Graphs are provided with which to estimate the trade-off between reliability and the number of lists presented, and to assess the significance of deviant scores from individual subjects.
Article
Neurofibromatosis type 2 (NF2) produces benign Schwann cell tumors on many cranial nerves, in particular on the vestibular portions of the VIIIn bilaterally. Removal of these vestibular schwannomas usually severs the auditory portion of the VIIIn, thus deafening the patients. The auditory brainstem implant (ABI) was designed to provide prosthetic electric stimulation of the cochlear nucleus in the brainstem to restore some hearing sensations to patients deafened by bilateral removal of vestibular schwannomas. This study will review the new developments and improving outcomes of the ABI. From its initial application in 1979 until about 2005, the ABI provided modest but useful auditory sensations to NF2 patients. However, application of the ABI in non-NF2 populations and in children with congenital malformations demonstrated better results, showing that the ABI could provide high levels of speech recognition. Recent results show excellent speech recognition in NF2 patients as well. This study will discuss the potential causes of the variability in ABI outcomes. ABIs activate neurons in the cochlear nucleus to recreate hearing sensations in people who have become deaf as a result of the loss of the auditory nerve. Most NF2 patients show functional hearing benefit from the ABI, with awareness and recognition of environmental sounds and enhancement of lipreading. It is now clear that ABIs can produce excellent speech recognition in some patients with NF2, allowing even conversational telephone use. Although the factors leading to this improved performance are not completely clear, these new results show that excellent hearing is possible for NF2 patients with the ABI.
Article
We review in a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation (Charpentier and Moulines, 1988; Moulines and Charpentier, 1988; Hamon et al., 1989). These algorithms rely on a pitch-synchronous overlap-add (PSOLA) approach for modifying the speech prosody and concatenating speech waveforms. The modifications of the speech signal are performed either in the frequency domain (FD-PSOLA), using the Fast Fourier Transform, or directly in the time domain (TD-PSOLA), depending on the length of the window used in the synthesis process. The frequency domain approach is capable of a great flexibility in modifying the spectral characteristics of the speech signal, while the time domain approach provides very efficient solutions for the real time implementation of synthesis systems. We also discuss the different kinds of distortions involved in these different algorithms.