Content uploaded by Andrew Faulkner
Author content
All content in this area was uploaded by Andrew Faulkner
Content may be subject to copyright.
Adaptation to spectrally-rotated speech
Tim Green, Stuart Rosen, Andrew Faulkner, and Ruth Paterson
Speech, Hearing, and Phonetic Sciences, UCL, Chandler House, 2, Wakefield Street, London, WC1N 1PF,
United Kingdom
(Received 20 December 2012; revised 28 May 2013; accepted 10 June 2013)
Much recent interest surrounds listeners’ abilities to adapt to various transformations that distort
speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech
is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely
altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and con-
trasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent
6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an
untrained control group completed pre- and post-training speech perception tests, for which talkers
differed from the training talker. Significantly improved recognition of spectrally-rotated sentences
was observed for trained, but not untrained, participants. However, there were no significant
improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants.
Additional tests were performed with speech materials manipulated so as to isolate the contribution
of various speech features. These showed that preserving intonational contrasts did not contribute
to the comprehension of spectrally-rotated speech after training, and suggested that improvements
involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on
speech features relatively unaffected by the transformation.
V
C
2013 Acoustical Society of America .
[http://dx.doi.org/10.1121/1.4812759]
PACS number(s): 43.71.Sy [JMH] Pages: 1369–1377
I. INTRODUCTION
Listeners possess considerable abilities to adapt to trans-
formations which, to various extents, degrade and distort im-
portant features of speech signals. Examples include noise-
or tone-excited vocoding (Hill et al., 1968; Shannon et al.,
1995; Dorman et al., 1997a), sine-wave speech (Remez
et al., 1981), time-compressed speech (Dupoux and Green,
1997), and spectral rotation (Blesser, 1972). The speed
and degree of adaptation vary across transformations.
Investigation of factors that contribute to adaptation and its
limitations may provide valuable insights into perceptual
learning processes and inform models of speech perception.
Adaptation to noise- or tone-vocoded speech has
received considerable interest, not least because this type of
processing has features in common with the processing typi-
cally applied in cochlear implant systems (Davis et al.,
2005; Hervais-Adelman et al., 2008; Hervais-Adelman
et al., 2011; Loebach et al., 2010). Spectral resolution is lim-
ited to a small number of broad frequency bands, temporal
fine structure is eliminated, but amplitude envelopes within
each frequency band are preserved. Learning of speech that
has been tone- or noise-vocoded but not subject to other dis-
tortion is very rapid. For example, using six-channel noise-
vocoded speech, Davis et al. (2005) reported that sentence
recognition improved from near zero to 70% words correct
over the course of presentation of just 30 sentences. One im-
portant finding has been that improvement after training is
seen for words that were not heard during training, suggest-
ing that learning involves modification of the processing of
phonetic cues at a sublexical level (Davis et al., 2005).
Similarly, with time-compressed speech, the degree to which
learning transfers across languages has been found to depend
on the phonological similarity between the languages
(Pallier et al., 1998), suggesting that learning occurred at the
phonetic level.
The rapidity of adaptation to vocoded speech probably
reflects the fact that, while fine spectral detail is lost, the over-
all shape and position of the spectral envelope are well pre-
served. For cochlear implant users, the representation
of speech spectral information is subject to more complex
transformations than those involved in straightforward vocod-
ing. For example, post-lingually deafened cochlear implant
users must adapt to some change of frequency to place map-
ping, since it is highly unlikely that all of the electrode con-
tacts will be at tonotopically correct places. Typically, an
overall upward spectral shift will arise due to incomplete
insertion of the electrode array (Ketten et al.,1998). Short-
term studies using noise-excited vocoding in normal hearing
listeners have shown that such shifts in spectral envelope
have a highly detrimental effect on speech perception, far
beyond that imposed by vocoding per se, and largely inde-
pendent of the degree of spectral resolution (Dorman et al.,
1997b; Shannon et al.
, 1998). However, a few hours of train-
ing has been shown to be sufficient to lead to significant
improvements in sentence recognition for speech that has
been both noise-vocoded and spectrally-shifted (Faulkner
et al., 2006; Fu and Galvin, 2003; Rosen et al.,1999).
Similarly, Smith and Faulkner (2006) showed that adaptation
was possible to noise-vocoded speech in which the fre-
quency-to-place map was “warped.” This simulated a situa-
tion in which there is a “dead” cochlear region with no
functional neurons, and the frequency map is adjusted so as
to distribute spectral information from the whole signal over
the functioning cochlear regions on either side of the dead
region.
J. Acoust. Soc. Am. 134 (2), August 2013
V
C
2013 Acoustical Society of America 13690001-4966/2013/134(2)/1369/9/$30.00
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
The addition of spectral shifting or warping results in
more complex transformations of speech spectral informa-
tion than noise- or tone-vocoding alone. However, these are
still monotonic transformations that largely preserve the rel-
ative shape of the spectral envelope. Substantially greater
difficulties in adaptation might be anticipated for distortions
of speech signals that involve non-monotonic transforma-
tions of the spectral envelope. Some form of non-
monotonicity might occur in cochlear implant users due to a
range of factors that affect the extent to which current deliv-
ered to a particular electrode is effective in stimulating an
appropriate population of auditory nerve fibers, such as the
possibility of cross-turn stimulation (e.g., Finley et al.,
2008). Such distortions are even more likely in users of audi-
tory brainstem implants (ABIs), where assignment of audi-
tory filters to electrodes relies on clinical pitch ranking
procedures. The difficulties inherent in such procedures
make it more difficult to establish a tonotopic pattern of elec-
trical stimulation (Colletti et al., 2012).
An extreme example of a non-monotonic spectral trans-
formation is spectral rotation, in which the bandwidth of the
signal is first restricted by low-pass filtering and the spec-
trum is then inverted around the center frequency (Blesser,
1969, 1972; Azadpour and Balaban, 2008). Some speech
features are more or less unaffected by spectral rotation,
including amplitude envelope, the presence or absence of pe-
riodicity, and pitch variation that conveys intonation.
However, the inversion of spectral shape and dynamics
makes rotated speech completely unintelligible for naive lis-
teners. That spectrally-rotated speech retains many of the
acoustical properties of actual speech while being unintelli-
gible has led to its widespread use as a non-speech control in
neuroimaging studies of speech perception (e.g., Scott et al.,
2000).
However, it appears that considerable adaptation to this
extreme transformation is possible over a fairly short time.
Blesser (1969, 1972) provided experience with spectral rota-
tion in several 30-min sessions. Pairs of participants who
were well-known to each other heard each other’s speech
only in spectrally-rotated form and communicated purely by
auditory means, using any approach that they found practi-
cal. A range of speech tests, including vowel and consonant
perception and recognition of single words and sentences,
were carried out at various times during the course of the
experiment.
Learning was observed both for the identification of
vowels and consonants and for comprehension of whole sen-
tences, with sentence scores reaching 35% syllables correct
on average after up to 10 h experience. A few points should
be noted here, however. First, in contrast to typical contem-
porary practice, scoring was based not just on key words but
on all syllables within a sentence. Second, participants were
tested with the same sentence list on repeated occasions.
Finally, there was large variability across participants, which
may in part reflect differences in the approaches to learning
adopted by different pairs of participants. Blesser speculated
that there may be a particularly important role for intonation
in the comprehension of spectrally-rotated speech, although
this was not tested directly. Spectral rotation of voiced
speech destroys the original harmonic spectral structure
since the fundamental frequency and all its harmonics are
transposed to different frequencies. However, while the new
spectral components are no longer integer multiples of a
common fundamental frequency (F0), the spacing between
them remains equal to the original F0. This gives rise to rela-
tively strong pitch percepts which rise and fall in the same
pattern as the pitch of the original speech (Plomp, 1967).
Thus, the shape of intonation contours, and the prosodic in-
formation that they convey, are well preserved after spectral
rotation and may contribute to adaptation. This contrasts
with noise-excited vocoded speech in which voice pitch in-
formation is severely degraded or non-existent, depending
upon the details of the processing (Green et al., 2002; Souza
and Rosen, 2009).
Blesser’s work would appear to suggest that a substan-
tial degree of adaptation to an extreme distortion of spectral
shape and dynamics is possible. This would represent a strik-
ing example of plasticity in the perception of a fundamental
acoustic property essential for speech understanding. Such
plasticity may be of relevance to some users of auditory
prostheses, who may experience severe spectral transforma-
tions, albeit not as extreme as spectral rotation. It may also
have implications in relation to the use of spectral rotation as
a non-speech control in imaging studies. However, the
uncontrolled nature of Blesser’s procedures makes it difficult
to be sure of the underlying processes. For example, it is not
clear to what extent improvements in sentence recognition
over the course of Blesser’s experiment were attributable
specifically to adaptation to altered spectral dynamics, rather
than learning to make use of relatively well preserved fea-
tures of transformed speech, or to increasing familiarity with
test materials and procedures. The latter point is important
since, using spectrally-shifted, noise-vocoded speech,
Stacey
and Summerfield (2008) showed that substantial improve-
ments in sentence recognition and phoneme identification
occurred due to repeated testing without any intervening
training.
Here, we investigate adaptation to spectrally-rotated
speech in more controlled conditions than those used by
Blesser (1969, 1972). We employed Connected Discourse
Tracking (CDT) (De Filippo and Scott, 1978), a training
method previously found effective for spectrally-shifted
speech (Rosen et al., 1999). Regular tests of phoneme and
sentence recognition probed the course of learning and test
conditions were included in which stimuli were manipulated
so as to assess the contribution of features largely unaffected
by spectral rotation, such as the shape of intonation contours
and the presence or absence of periodicity. Based on
Blesser’s findings it is hypothesized that participants who
receive training will show significantly larger improvements
in perception of spectrally-rotated consonants, vowels, and
sentences than participants who receive no training. In addi-
tion, if learning does not involve adaptation to altered spec-
tral shape and dynamics, but merely reflects enhanced use of
speech features preserved by spectral rotation, benefits of
training would still be expected for stimuli in which ampli-
tude envelope, pitch, and periodicity information is pre-
served, while spectral variation is eliminated. For the
1370 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
particular feature of intonation, conversely, performance for
the trained group would be significantly reduced for stimuli
processed so as to eliminate intonation cues.
II. METHODS
A. Participants
Eight normal hearing adults participated after giving
informed consent. All were aged between 20 and 30 and 5
were male. Four of the participants (T1–T4) received train-
ing with spectrally-rotated speech, while the remaining four
did not. One of the untrained participants was bilingual in
English and Gujarati, while the remainder had English as
their sole native language.
B. Speech tests
1. Consonants
Recordings of 20 consonants [m n w r l j b p d t g k t SS
D s f ð z v], in VCV format, spoken by 1 male and 1 female
speaker of Southern Standard British English were available.
Three different vowel contexts were used (/i/, /u/, and /A/),
resulting in a total of 120 stimuli. Participants responded
using a mouse to click on 1 of 20 orthographically-labeled
buttons displayed on a computer screen.
2. Vowels
Recordings of 17 vowels in /bVd/ context, spoken by
the same male and female speakers of Southern Standard
British English were available. There were ten mono-
phthongs [/æ/ (bad), /A+/ (bard), /i+/ (bead), /E/ (bed), /I/
(bid), /˘+/ (bird), /`/ (bod), /O+/ (board), /u+/ (booed), /ˆ/
(bud)] and seven diphthongs [/e@/ (bared), /eI/ (bayed), /I@/
(beard), /aI/ (bide), /@U/ (bode), /aU/ (boughed), /OI/ (Boyd)].
Two tokens from each speaker for each vowel were used,
giving a total of 68 stimuli. Participants responded by click-
ing on 1 of 17 on-screen buttons orthographically labeled
with the full words.
3. Sentences
Two sets of sentence materials were used. One com-
prised video recordings of Bamford-Kowal-Bench (BKB)
sentences (Bench et al., 1979), read by a female speaker of
Southern Standard British English. In some conditions these
were presented audiovisually, while in others only the sound
was presented. The other consisted of audio-only recordings
of the Adaptive Sentence List (ASL) sentences (MacLeod
and Summerfield, 1990) read by a male speaker of Southern
Standard British English. Like BKB sentences, the ASL
materials are short, highly predictable sentences, e.g., “The
bag was very heavy.” The speakers were different from those
who produced the consonant and vowel materials. Twenty-
one BKB lists, each containing 16 sentences and 50 key-
words, and 18 ASL lists, each containing 15 sentences and
45 key words, were used. After a sentence was presented,
participants repeated whatever words they thought they had
heard. The experimenter then recorded the number of key
words correct, applying a loose scoring method in which a
response was scored as correct if its root matched that of the
key word.
C. Signal processing and equipment
Participants listened via Sennheiser headphones (HD 25
SP). Spectral rotation was performed in real time using
the software system Aladdin (Hitech AB, Sweden) and a
digital-signal-processing PC card (Loughborough Sound
Images TMS320C31) running at a sampling rate of
22.05 kHz. Input speech was first low-pass filtered at 4 kHz
using a tenth-order elliptic filter. Additional filtering (33-
point finite impulse response) was applied in order to mini-
mize differences in the long-term spectra of rotated and nor-
mal speech. The design of this additional filter was based
largely on published measurements of the long-term average
speech spectrum (Byrne et al., 1994), although the roll-off
below 120 Hz was ignored, and a flat spectrum below
420 Hz assumed. Spectral rotation around 2 kHz was then
implemented via modulation with a 4-kHz sinusoid. In order
to remove upper frequency side bands, the modulated signal
was low-pass filtered again with the same elliptic filter. The
total root-mean-square level of the spectrally-rotated signal
was set equal to that of the original low-pass filtered signal.
D. Additional manipulations probing the role
of speech features unaffected by spectral rotation
In order to assess the contribution of intonation to the
comprehension of spectrally-rotated speech, sentence tests
were included in which, prior to spectral rotation, stimuli
were manipulated so as to eliminate intonation information.
The Pitch Synchronous Overlap Add technique (Moulines
and Charpentier, 1990), as implemented in Praat (Boersma
and Weenink, 2001) was used to replace the natural pitch
contours of the BKB and ASL sentences with a monotone at
230 Hz for the female talker and at 150 Hz for the male
talker. As a control for possible artifacts introduced by Praat,
a further condition was included in which the natural pitch
contours were shifted, up by 3.5 semitones for the male
talker and down by 3.5 semitones for the female talker.
These conditions will subsequently be referred to respec-
tively as “Monotone” and “Shifted.”
An alternative approach to assessing the relative contri-
butions of spectral dynamics and features unaffected by spec-
tral rotation involved eliminating spectral variation while
preserving amplitude envelope and pitch and periodicity in-
formation. This manipulation, previously used by Faulkner
and Rosen (1999), was implemented in real time in Aladdin
by the use of a second input signal comprising pulses occur-
ring once per pitch period. This pulse input triggered the gen-
eration of a pulse train carrier within the DSP system which
was then modulated by an amplitude envelope extracted from
the speech signal (bandpass filtered between 50 Hz and
3 kHz, 6-dB per octave roll-off). Envelope extraction
employed full-wave rectification and a 32-Hz low-pass filter
(fourth-order elliptic). During unvoiced speech segments a
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1371
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
white-noise carrier was used. Mixed excitation sounds (e.g.,
/z/) led to a voiced output alone. Finally, spectral rotation
was applied as described above. This condition will subse-
quently be referred to as “PP” (for “Pitch and Periodicity”).
A further condition, tested only in two of the trained par-
ticipants (see Sec. II F below), assessed the possibility that
improvements in sentence recognition with training primar-
ily reflected an enhanced ability to take advantage of infor-
mation from the narrow band of frequencies centered on the
frequency around which the spectrum was rotated, which are
relatively unaffected by the transformation. Prior to spectral
rotation, sentences were subject to filtering with steep cutoffs
(Chebyshev type II) to restrict the signal to a single band
centered at 2 kHz. The bandwidth was 240 Hz, correspond-
ing to the equivalent rectangular bandwidth of the auditory
filter (Moore and Glasberg, 1983).
E. Training
CDT was implemented in a similar way to Rosen et al.
(1999). A single female speaker of Southern British English
(one of the authors, R. Paterson) was the talker for all four
trained participants. The materials used were books drawn
from the Heinemann Guided Readers series aimed at learn-
ers of English. The trainer and the participant were seated in
adjacent sound-proof rooms separated by a double-pane
glass partition. The trainer read phrases from the text which
the participant was required to repeat verbatim. Following
an accurate response, the trainer moved on to the next
phrase. If an error was made, the trainer repeated all or part
of the phrase, and the participant responded again. If the
phrase had not been accurately repeated after three attempts,
it was presented unprocessed before moving on to the next
phrase. Performance on the task was assessed in terms of the
rate, in words/min, at which the participant was able to cor-
rectly repeat the phrases spoken by the trainer. A low-level
pink noise was present in the participant’s room to mask any
of the trainer’s natural speech not sufficiently attenuated by
the intervening wall. Approximately half the training was
carried out with the participant able to see the speaker’s
face, while for the remainder, the glass partition between the
two rooms was covered and the participant received only
audio input. The addition of visual input was intended to
expedite learning, particularly in the early stages of training
when performance with the audio signal alone was expected
to be very poor.
F. Procedure
Trained participants completed four sessions of speech
perception testing. The first three sessions took place at
approximately weekly intervals. There was a shorter gap,
typically 3 to 4 days, between the third and fourth sessions.
They underwent a total of 6 h of CDT with spectrally-rotated
speech: 3 h between the first and second testing sessions, and
3 h between the second and third testing sessions. No train-
ing took place between the third and fourth testing sessions.
Training was completed in eight 45-min sessions, divided
into 5-min blocks which alternated between audiovisual and
audio-only presentation, with the exception that the first four
blocks in the initial training session all used audiovisual pre-
sentation. Untrained participants completed the same tests as
the trained group, but over a shorter period of time, typically
within a few days.
The first three testing sessions contained tests of the per-
ception of spectrally-rotated speech, using all the different
speech materials. In each session the 120 VCV stimuli were
presented once and the 68 vowel stimuli twice. Four BKB
sentence lists (female talker) were presented, two audiovisu-
ally and two audio-only. Two ASL sentence lists (male
talker) were presented audio-only. In addition, in the first
testing session vowel and consonant perception were
assessed with stimuli low-pass filtered at 4 kHz, but other-
wise unprocessed. No feedback was given during any of the
testing. The order in which the four different types of tests
were conducted by each participant was based on random-
ized Latin Squares. Sentence lists were chosen at random
(without replacement) for each participant.
The fourth testing session examined speech recognition
in the conditions with additional manipulations intended to
elucidate the factors underlying learning of spectrally-
rotated speech. Sentence recognition was assessed audio-
only in Shifted, Monotone, and PP conditions (one BKB and
two ASL lists in each condition).
In addition, two of the four trained participants (T1 and
T4) carried out additional testing approximately 7 weeks af-
ter the fourth testing session. Sentence recognition was
tested both for spectrally-rotated speech as previously expe-
rienced, to assess retention of learning, and for speech fil-
tered into a single spectral band around 2 kHz. Prior to
testing they were given a brief “reminder” CDT session
comprising one 5-min block of audiovisual training followed
by four 5-min blocks of audio-only training.
III. RESULTS
A. Connected discourse tracking
Figure 1 shows how performance in CDT training
changed over the eight 45-min training sessions. As would
be expected, performance was better in the audiovisual con-
dition than with audio-only presentation, reflecting the avail-
ability of speech-reading cues. In the audiovisual condition,
performance reached a plateau around the fourth training
session at approximately 80 words/min. This is some way
short of the maximum rate previously found with normal
speech and full visual and acoustic cues of between 110 and
130 words/min, depending on the complexity of the training
test (De Filippo and Scott, 1978; Hochberg et al., 1989).
This suggests that although tracking rates ceased to improve
over the final few training sessions, adaptation was not com-
plete. In the audio-only condition performance continued to
improve until the final one or two sessions. The lack of
improvement in the final session may partly reflect the fact
that, during this session, three of the four participants
reached the end of the book that had been used since the start
of training and therefore had to adjust to new material.
Linear regressions on words correct per minute against
training session in audio-only conditions showed that there
was a highly significant increase in performance across
1372 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
session (p < 0.001 for all four participants). For three of the
four participants the regression slopes were similar, corre-
sponding to an increase of between 15 and 20 words/min for
each hour of training. The remaining participant (T3) learned
considerably more slowly with a regression slope showing
an increase of only 4 words/min for every hour of training.
B. Consonants
Consonant identification data is shown in Fig. 2.
Performance was near ceiling with stimuli which were
unprocessed beyond low-pass filtering at 4 kHz. For
spectrally-rotated stimuli, performance was very low in all
three testing sessions for both groups with both talkers, with
no evidence of learning for the trained group. Here, and sub-
sequently, data were analyzed using mixed effects linear
modeling. This enables the amount of training to be treated
as a continuous variable and also allows incorporation of
scores from multiple lists in the same condition. Data from
the spectrally-rotated stimuli were analyzed with factors of
talker (male or female), group (trained or control), and test
session number. No significant main effects or interactions
were observed {main effect of talker [F(1,40) ¼ 1.64,
p ¼ 0.208], interaction between talker and session number
[F(1,40) ¼ 1.17, p ¼ 0.285], all other F’s < 1}.
C. Vowels
As shown in Fig. 3, vowel identification data showed a
similar pattern to consonant identification: Near-ceiling per-
formance on stimuli that were merely low-pass filtered at
4 kHz, very low performance in all tests with spectrally-
rotated stimuli, and no evidence of learning. An analysis
using a mixed effects linear model with factors of group,
talker, and session number showed no significant main
effects or interactions {main effect of group [F(1,88) ¼ 1.07,
p ¼ 0.303], all other F’s < 1}.
D. Sentences
1. Spectral rotation only
Recognition of spectrally-rotated sentences with audio-
only presentation over the first three test sessions will be
examined first (Fig. 4). In contrast to the vowel and conso-
nant data, there was clear evidence of learning. In the first
testing session performance was very low for both trained
and untrained groups. In subsequent testing sessions, how-
ever, while performance remained poor for the untrained
group, it steadily increased for the trained group. Increased
sentence recognition with training was apparent with both
test talkers, but was particularly pronounced for the female
talker.
All four trained participants showed improvements over
the different test sessions although there was considerable
variability in the extent of improvement. Consistent with the
CDT data the least improvement was shown by T3, for
whom the mean proportion of key words correct for the
FIG. 1. Boxplots of performance over the eight training sessions of CDT
with spectrally-rotated speech. Mean words correct per minute, averaged
across the 5 min segments within each training session, are shown for both
audiovisual and audio-only presentation.
FIG. 2. Boxplots of consonant identification for trained and control groups
for the male and female talkers (top and bottom panels, respectively). The
two leftmost boxes show performance with low-pass filtered but otherwise
unprocessed materials obtained in the first testing session (LP). The two
leftmost boxes show performance with low-pass filtered (LP) but otherwise
unprocessed materials obtained in the first testing session.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1373
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
female talker increased from 0.01 to 0.11 over the three test
sessions. For the other three participants the increase in pro-
portion correct for the female talker ranged between 0.26
and 0.64. Benefits from training were somewhat more con-
sistent for the male talker with increases in proportion cor-
rect for all four listeners ranging between 0.09 and 0.22.
Audio-only data from the first three test sessions were
analyzed with a mixed effects linear model with factors of
talker, session, and group. The three-way interaction was
close to significance [F(1,87) ¼ 3.17, p ¼ 0.078], as was the
two-way interaction between talker and session [F(1,87)
¼ 3.42, p ¼ 0.068]. The two-way interaction between talker
and group was not significant [F < 1]. Most importantly, there
was a highly significant two-way interaction between session
and group [F(1,87) ¼ 19.67, p < 0.001], reflecting the fact
that performance did not change for the untrained group, but
increased over time for the trained group. The main effect of
talker was not significant [F(1,87) ¼ 2.69, p ¼ 0.105]. The
significant interaction between group and session means that
analysis of their main effects is of little consequence, but for
completeness we observe that there was no significant effect
of group [F(1,87) ¼ 2.92, p ¼ 0.091], but a highly significant
effect of session [F(1,87) ¼ 23.93, p < 0.001].
With audiovisual presentation (female talker only) there
was also evidence of learning, although there was very large
variability in performance levels before training and ceiling
effects occurred for the trained group. Despite these limita-
tions, a mixed effects linear analysis showed evidence of
learning in the form of a significant interaction between ses-
sion and group [F(1,44) ¼ 5.65, p ¼ 0.022]. Data from audio-
visual conditions will not be considered further.
2. Spectral rotation with additional signal
manipulations
As shown in Fig. 4, performance was also better for the
trained than the untrained group in the Monotone and
Shifted conditions. For the male talker there was little differ-
ence between performance in these conditions and that in the
third test session with spectrally-rotated speech, conducted
after 6 h of training. For the female talker the Monotone con-
dition did produce poorer performance for the trained group
than that in the third test session with spectrally-rotated
speech. However, for three of the four participants a very
similar decrement in performance was also apparent in the
Shifted condition. Since the intonation information con-
tained in the Shifted condition is very similar to that in the
speech subjected only to spectral rotation, it is likely that the
decrement in the Monotone condition was primarily attribut-
able to artifacts of the voice pitch manipulation process,
rather than to the absence of intonation information.
Data from the Monotone and Shifted conditions and
from the third test session with spectrally-rotated speech
were submitted to a mixed effects linear analysis with
factors of group, talker, and condition. There was a highly
significant effect of group [F(1,68) ¼ 37.09, p < 0.001],
but no other main effects or interactions were significant
FIG. 3. Boxplots of vowel identification for trained and control groups for
the male and female talkers (top and bottom panels, respectively). The two
leftmost boxes show performance with low-pass filtered but otherwise
unprocessed materials obtained in the first testing session (LP). The remain-
ing boxes show performance with spectrally-rotated materials obtained over
the first three testing sessions.
FIG. 4. Boxplots of audio-only sentence recognition for the male and female
talkers. Each panel shows (from left to right) performance with spectrally-
rotated sentences in the first three testing sessions, and with pitch-shifted
and monotone sentences from the fourth testing session.
1374 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
{three-way interaction [F(1,68) ¼ 1.56, p ¼ 0.218], interac-
tion between group and talker [F(1,68) ¼ 2.40, p ¼ 0.126],
interaction between group and condition [F(1,68) ¼ 1.17,
p ¼ 0.318], all other F’s < 1}.
The elimination of spectral variation implemented in the
PP condition led to floor effects for both trained and
untrained groups. Across the two groups, 21 out of a total of
24 runs produced zero key words correct.
Figure 5 shows audio-only sentence recognition for the
two trained participants who were available for further test-
ing approximately 7 weeks after completing the main experi-
ment. Their performance in the third testing session is shown
for comparison and it is clear that there was no decrement in
performance over the intervening period. Also shown in Fig.
5 is that performance was at floor when signals were re-
stricted to a single spectral band centered at 2 kHz.
IV. DISCUSSION
Audio-only sentence recognition improved for the
trained group but not for the untrained group. This confirms
that adaptation to spectrally-rotated speech is possible and
shows that the improvement was not attributable to repeated
exposure to the test procedures. Improved sentence recogni-
tion after training was observed for two different talkers,
both of whom were different to the training talker, indicating
generalization of learning across talkers. Recognition of
spectrally-rotated sentences in two subjects available for
follow-up testing did not differ from that obtained immedi-
ately after training, suggesting that learning was robust over
a period of several weeks. Data obtained with additional
stimulus manipulations, discussed in more detail below,
were consistent with the idea that improvements with
training did not simply reflect better use of those, primarily
temporal, speech features that are relatively well preserved
after spectral rotation. These findings demonstrate that quite
rapid adaptation is possible to a radical transformation of the
representation of critical speech spectral information.
Spectral rotation is clearly not directly comparable to the
transformations that might be experienced by users of audi-
tory prostheses. However, this does suggest that non-
monotonic transformations of spectral information, such as
might be experienced particularly by ABI users, do not, per
se, preclude the regaining of substantial levels of speech
recognition.
It was conceivable that improvements after training
might involve participants learning to make better use of
speech information that was relatively unaffected by the
transformation, such as periodicity and envelope, while
ignoring distorted spectral information. However, this was
not supported by testing with stimulus manipulations that,
prior to spectral rotation, selectively eliminated or preserved
different aspects of speech information. Sentence recogni-
tion for the two re-tested participants was at floor for stimuli
restricted to the central frequency band around 2 kHz that
was largely unaffected by spectral rotation, showing that
they had not learned simply to ignore information from
transformed spectral regions. Performance was similarly
poor for stimuli in which spectral variation was eliminated,
while pitch, periodicity, and amplitude envelope were pre-
served. This suggests that post-training improvements in per-
ception of spectrally-rotated speech did reflect adaptation to
altered spectral information and is consistent with the notion
that access to spectral dynamics is critical to speech under-
standing (Rosen and Iverson, 2007). In addition, sentence
recognition was not significantly poorer when natural F0
contours were replaced with monotones. Therefore, the pres-
ervation of intonation contour shape does not appear to be
critical to the comprehension of spectrally-rotated speech af-
ter training. It remains possible, however, that the presence
of near natural intonation patterns may be important during
the learning process through, for example, providing helpful
cues to segmentation and syntactic structure.
Somewhat surprisingly, despite the improvements in
CDT rates and sentence recognition with training, there were
no significant improvements in medial vowel or intervocalic
consonant identification, suggesting a particularly important
role for contextual information (Boothroyd and Nittrouer,
1988). One possibility is that the multiple constraints on
word choices provided by simple, predictable sentences
make these materials sensitive to improvements in the per-
ception of speech features that are too small to be manifested
in tests of isolated phoneme identification. However, it
should be noted that this aspect of our data contrasts with
Blesser (1969, 1972), where there were improvements in
both vowel and consonant recognition. Methodological fac-
tors might be important here. For example, Blesser’s partici-
pants completed tests of vowel and consonant discrimination
prior to identification tests. It should, however, also be noted
that the improvements in vowel or consonant identification
observed by Blesser were quite small and that there was little
correlation with improvements in sentence recognition.
FIG. 5. Boxplots of audio-only sentence recognition for the male and female
talkers from the additional testing carried out with two trained participants,
7 weeks after the end of the initial testing. This included simple spectral
rotation and also conditions in which stimuli were restricted to a single fre-
quency band centered at 2 kHz. For comparison the performance of these
two participants with spectrally-rotated sentences in the third testing session
is shown on the left-hand side of the figure.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1375
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
Studies of the learning of spectrally-shifted noise- or
tone-vocoded speech have also produced an inconsistent pic-
ture regarding the relationship between sentence recognition
and phoneme identification, measured using VCV and bVd
or hVd materials. Using CDT training Rosen et al. (1999)
found large improvements in recognition of BKB sentences
and also significant improvements in vowel and consonant
identification. Fu et al. (2005) implemented an interactive,
computer-based training method using unconnected HINT
sentences (Nilsson et al., 1994) and found significant
improvements for consonant, but not vowel, identification.
Unfortunately, Fu et al. (2005) did not include sentence tests,
but large increases in sentence recognition during training
were reported. Using a similar training method Stacey and
Summerfield (2008) found significant improvements in rec-
ognition of BKB and IEEE sentences, but no significant
effect of training on either vowel or consonant identification.
Sentence recognition, of course, involves a wider range
of cognitive skills operating on lexical, semantic, and syntac-
tic information, which introduces more variability across
participants. It may also be relevant that the two aforemen-
tioned studies that showed significant improvement for both
vowel and consonant identification (Blesser, 1972; Rosen
et al. , 1999) used a single talker for each type of test material
whereas the remainder, including the present study, used
multiple talkers within vowel and consonant tests. It may be
that trial-to-trial variation of talkers depresses performance
in tests of brief isolated phonemes which provide very little
time to adjust to a different talker.
The finding that learning generalized across talkers
indicates that adaptation was occurring at a level of abstrac-
tion beyond the particular acoustic-phonetic patterns pro-
duced by the training talker. Generalization to new talkers
would likely be increased by using a number of different
talkers during training (Stacey and Summerfield, 2007).
Further research might explore which aspects of speech
processing are being modified during learning of spectrally-
rotated speech. In the context of vocoded or synthetic
speech, there has been considerable research examining the
transfer of perceptual learning of speech to contexts that
were not experienced in training, providing information
about the levels of processing at which training related
changes occur (e.g., Francis et al., 2007; Dahan and Mead,
2010, Hervais-Adelman et al., 2011). Such work suggests
that while there is more or less complete generalization
across some acoustic features, there are also context-
dependent aspects of learning. For example, Hervais-
Adelman et al. (2011) showed that learning of vocoded
speech transferred to an untrained frequency region, but that
learning only partly generalized across different carrier sig-
nals used in vocoding. Dahan and Mead (2010) found that
after training with noise-vocoded monosyllables, perception
of consonants in test stimuli differed according to whether
they appeared in the same position, or flanked by the same
vowel, as in the training stimuli. There is also evidence that
lexical information plays an important role in learning of
vocoded speech (Stacey and Summerfield, 2008), but that
semantic context is not essential for learning (Davis et al.,
2005; Loebach et al., 2010).
It would appear reasonable to expect a considerable
overlap between the processes involved in adapting to
spectrally-shifted vocoded speech and spectrally-rotated
speech. However, the transformations do differ substantially,
e.g., in the contrasting extent to which information about
intonation and spectral dynamics is preserved, and it remains
possible that there may be aspects of learning that are spe-
cific to spectral rotation. Other unresolved issues include the
extent to which further improvements in speech perception
might be possible with long-term experience with extreme
spectral transformations, and the degree to which short-term
adaptation might occur with passive exposure to spectral
rotation rather than specific training. However, it does
appear that some caution may be necessary in the treatment
of spectrally-rotated speech as a non-speech control in neu-
roimaging research, in particular when relatively long expo-
sure periods are used.
V. CONCLUSIONS
Considerable adaptation to an extreme form of distor-
tion of spectral information was possible with a few hours
experience. Learning did appear to involve adaptation to
altered spectral shape and dynamics. The fact that intona-
tional contrasts are well preserved after the transformation
did not appear to be important for the comprehension of
spectrally-rotated speech after training, though it remains
possible that intonation might contribute to the process of
adaptation.
ACKNOWLEDGMENT
This work was partially supported by Action on Hearing
Loss (Grant No. G53).
Azadpour, M., and Balaban, E. (2008). “Phonological representations are
unconsciously used when processing complex, non-speech signals,” PLoS
ONE 3, e1966.
Bench, J., Kowal, A., and Bamford, J. (1979). “The BKB (Bamford-Kowal-
Bench) sentence lists for partially-hearing children,” Br. J. Audiol. 13,
108–112.
Blesser, B. (1969). “Perception of spectrally rotated speech,” Ph.D. disserta-
tion, Massachusetts Institute of Technology, Cambridge, MA.
Blesser, B. (1972). “Speech perception under conditions of spectral transfor-
mation. 1. Phonetic characteristics,” J. Speech Hear. Res. 15, 5–41.
Boersma, P., and Weenink, D. (2001). “Praat: Doing phonetics by computer
(version 3.9.28) [computer program],” http://www.praat.org (Last viewed
May 2001).
Boothroyd, A., and Nittrouer, S. (1988). “Mathematical treatment of context
effects in phoneme and word recognition,” J. Acoust. Soc. Am. 84,
101–114.
Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R.,
Hagerman, B., Hetu, R., Kei, J., Lui, C., Kiessling, J., Kotby, M. N.,
Nasser, N. H. A., El Kholy, W. A. H., Nakanishi, Y., Oyer, H., Powell, R.,
Stephens, D., Meredith, R., Sirimanna, T., Tavartkiladze, G., Frolenkov,
G. I., Westerman, S., and Ludvigsen, C. (1994). “An international compar-
ison of long-term average speech spectra,” J. Acoust. Soc. Am. 96,
2108–2120.
Colletti, L., Shannon, R., and Colletti, V. (2012). “Auditory brainstem
implants for neurofibromatosis type 2,” Curr. Opin. Otolaryngol. Head
Neck Surg. 20, 353–357.
Dahan, D., and Mead, R. L. (2010). “Context-conditioned generalization in
adaptation to distorted speech,” J. Exp. Psychol. Hum. Percept. Perform.
36, 704–728.
Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., and
McGettigan, C. (2005). “Lexical information drives perceptual learning of
1376 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
distorted speech: Evidence from the comprehension of noise-vocoded
sentences,” J. Exp. Psychol. Gen. 134, 222–241.
De Filippo, C., and Scott, B. L. (1978). “Method for training and evaluating
reception of ongoing speech,” J. Acoust. Soc. Am. 63, 1186–1192.
Dorman, M. F., Loizou, P. C., and Rainey, D. (1997a). “Speech intelligibil-
ity as a function of the number of channels of stimulation for signal pro-
cessors using sine-wave and noise-band outputs,” J. Acoust. Soc. Am.
102, 2403–2411.
Dorman, M. F., Loizou, P. C., and Rainey, D. (1997b). “Simulating the
effect of cochlear-implant electrode insertion depth on speech under-
standing,” J. Acoust. Soc. Am. 102, 2993–2996.
Dupoux, E., and Green, K. (1997). “Perceptual adjustment to highly com-
pressed speech: Effects of talker and rate changes,” J. Exp. Psychol. Hum.
Percept. Perform. 23, 914–927.
Faulkner, A., and Rosen, S. (1999). “Contributions of temporal encodings of
voicing, voicelessness, fundamental frequency, and amplitude variation to
audio-visual and auditory speech perception,” J. Acoust. Soc. Am. 106,
2063–2073.
Faulkner, A., Rosen, S., and Norman, C. (2006). “The right information may
matter more than frequency-place alignment: Simulations of frequency-
aligned and upward shifting cochlear implant processors for a shallow
electrode array insertion,” Ear Hear. 27, 139–152.
Finley, C. C., Holden, T. A., Holden, L. K., Whiting, B. R., Chole, R. A.,
Neely, G. J., Hullar, T. E., and Skinner, M. W. (2008). “Role of electrode
placement as a contributor to variability in cochlear implant outcomes,”
Otol. Neurotol. 29, 920–928.
Francis, A. L., Nusbaum, H. C., and Fenn, K. (2007). “Effects of training on
the acoustic-phonetic representation of synthetic speech,” J. Speech Lang.
Hear. Res. 50, 1445–1465.
Fu, Q.-J., and Galvin, J. J. (2003). “The effects of short-term training for
spectrally mismatched noise-band speech,” J. Acoust. Soc. Am. 113,
1065–1072.
Fu, Q.-J., Nogaki, G., and Galvin, J. J. (2005). “Auditory training with
spectrally-shifted speech: Implications for cochlear implant patient audi-
tory rehabilitation,” J. Assoc. Res. Otolaryngol. 6, 180–189.
Green, T., Faulkner, A., and Rosen, S. (2002). “Spectral and temporal cues
to pitch in noise-excited vocoder simulations of continuous-interleaved-
sampling cochlear implants,” J. Acoust. Soc. Am. 112, 2155–2164.
Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., and Carlyon, R. P.
(2008). “Perceptual learning of noise vocoded words: Effects of feed-
back and lexicality,” J. Exp. Psychol. Hum. Percept. Perform. 34,
460–474.
Hervais-Adelman, A. G., Davis, M. H., Johnsrude, I. S., Taylor, K. J., and
Carlyon, R. P. (2011). “Generalization of perceptual learning of vocoded
speech,” J. Exp. Psychol. Hum. Percept. Perform. 37, 283–295.
Hill, J., McRae, P., and McClellan, R. (1968). “Speech recognition as a
function of channel capacity in a discrete set of channels,” J. Acoust. Soc.
Am. 44, 13–18.
Hochberg, I., Rosen, S., and Ball, V. (1989). “Effect of text complexity on
connected discourse tracking rate,” Ear Hear. 10, 192–199.
Ketten, D. R., Vannier, M. W., Skinner, M. W., Gates, G. A., Wang, G., and
Neely, J. G. (1998). “In vivo measures of cochlear length and insertion
depth of nucleus cochlear implant electrode arrays,” Ann. Otol. Rhin.
Laryngol. 107, 1–16.
Loebach, J. L., Pisoni, D. B., and Svirsky, M. A. (2010). “Effects of seman-
tic context and feedback on perceptual learning of speech processed
through an acoustic simulation of a cochlear implant,” J. Exp. Psychol.
Hum. Percept. Perform. 36, 224–234.
MacLeod, A., and Summerfield, A. Q. (1990). “A procedure for measuring
auditory and audio-visual speech-reception thresholds for sentences in
noise: Rationale, evaluation, and recommendations for use,” Br. J. Audiol.
24, 29–43.
Moore, B. C. J., and Glasberg, B. R. (1983). “Suggested formulae for calcu-
lating auditory filter bandwidths and excitation patterns,” J. Acoust. Soc.
Am. 74, 750–753.
Moulines, E., and Charpentier, F. (1990). “Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones,”
Speech Commun. 9, 453–467.
Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “Development of the
hearing in noise test for the measurement of speech reception thresholds in
quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099.
Pallier, C., Sebastian-Gall
es, N., Dupoux, E., Christophe, A., and Mehler, J.
(1998). “Perceptual adjustment to time-compressed speech: A cross-
linguistic study,” Mem. Cognit. 26, 844–851.
Plomp, R. (1967). “Pitch of complex tones,” J. Acoust. Soc. Am. 41,
1526–1533.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981). “Speech
perception without traditional speech cues,” Science 212, 947–950.
Rosen, S., Faulkner, A., and Wilkinson, L. (1999). “Adaptation by normal
listeners to upward spectral shifts of speech: Implications for cochlear
implants,” J. Acoust. Soc. Am. 106, 3629–3636.
Rosen, S., and Iverson, P. (2007). “Constructing adequate non-speech ana-
logues: what is special about speech anyway?” Dev. Sci. 10, 165–168.
Scott, S. K., Blank, C. C., Rosen, S., and Wise, R. J. S. (2000).
“Identification of a pathway for intelligible speech in the left temporal
lobe,” Brain 123, 2400–2406.
Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., and Ekelid, M.
(1995). “Speech recognition with primarily temporal cues,” Science 270,
303–304.
Shannon, R. V., Zeng, F.-G., and Wygonski, J. (1998
). “Speech recognition
with altered spectral distribution of envelope cues,” J. Acoust. Soc. Am.
104, 2467–2476.
Smith, M. W., and Faulkner, A. (2006). “Perceptual adaptation by normally
hearing listeners to a simulated ‘hole’ in hearing,” J. Acoust. Soc. Am.
120, 4019–4030.
Souza, P., and Rosen, S. (2009). “Effects of envelope bandwidth on the
intelligibility of sine- and noise-vocoded speech,” J. Acoust. Soc. Am.
126, 792–805.
Stacey, P. C., and Summerfield, A. Q. (2007). “Effectiveness of computer-
based auditory training in improving the perception of noise-vocoded
speech,” J. Acoust. Soc. Am. 121, 2923–2935.
Stacey, P. C., and Summerfield, A. Q. (2008). “Comparison of word-, sen-
tence-, and phoneme-based training strategies in improving the perception
of spectrally distorted speech,” J. Speech Lang. Hear. Res. 51, 526–538.
J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1377
Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms