ArticlePDF Available

Adaptation to spectrally-rotated speech

August 2013
The Journal of the Acoustical Society of America 134(2):1369-77

August 2013
134(2):1369-77

DOI:10.1121/1.4812759

Source
PubMed

Authors:

Andrew Faulkner

University College London

Much recent interest surrounds listeners' abilities to adapt to various transformations that distort speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and contrasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent 6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an untrained control group completed pre- and post-training speech perception tests, for which talkers differed from the training talker. Significantly improved recognition of spectrally-rotated sentences was observed for trained, but not untrained, participants. However, there were no significant improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants. Additional tests were performed with speech materials manipulated so as to isolate the contribution of various speech features. These showed that preserving intonational contrasts did not contribute to the comprehension of spectrally-rotated speech after training, and suggested that improvements involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on speech features relatively unaffected by the transformation.

Boxplots of performance over the eight training sessions of CDT with spectrally-rotated speech. Mean words correct per minute, averaged across the 5 min segments within each training session, are shown for both audiovisual and audio-only presentation.

…

Boxplots of consonant identification for trained and control groups for the male and female talkers (top and bottom panels, respectively). The two leftmost boxes show performance with low-pass filtered but otherwise unprocessed materials obtained in the first testing session (LP). The two leftmost boxes show performance with low-pass filtered (LP) but otherwise unprocessed materials obtained in the first testing session.

…

Boxplots of vowel identification for trained and control groups for the male and female talkers (top and bottom panels, respectively). The two leftmost boxes show performance with low-pass filtered but otherwise unprocessed materials obtained in the first testing session (LP). The remaining boxes show performance with spectrally-rotated materials obtained over the first three testing sessions.

…

Boxplots of audio-only sentence recognition for the male and female talkers. Each panel shows (from left to right) performance with spectrally- rotated sentences in the first three testing sessions, and with pitch-shifted and monotone sentences from the fourth testing session.

…

Boxplots of audio-only sentence recognition for the male and female talkers from the additional testing carried out with two trained participants, 7 weeks after the end of the initial testing. This included simple spectral rotation and also conditions in which stimuli were restricted to a single frequency band centered at 2 kHz. For comparison the performance of these two participants with spectrally-rotated sentences in the third testing session is shown on the left-hand side of the figure.

…

Figures - uploaded by Andrew Faulkner

Content may be subject to copyright.

Content uploaded by Andrew Faulkner

Content may be subject to copyright.

Adaptation to spectrally-rotated speech

Tim Green, Stuart Rosen, Andrew Faulkner, and Ruth Paterson

Speech, Hearing, and Phonetic Sciences, UCL, Chandler House, 2, Wakeﬁeld Street, London, WC1N 1PF,

United Kingdom

(Received 20 December 2012; revised 28 May 2013; accepted 10 June 2013)

Much recent interest surrounds listeners’ abilities to adapt to various transformations that distort

speech. An extreme example is spectral rotation, in which the spectrum of low-pass ﬁltered speech

is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely

altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and con-

trasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent

6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an

untrained control group completed pre- and post-training speech perception tests, for which talkers

differed from the training talker. Signiﬁcantly improved recognition of spectrally-rotated sentences

was observed for trained, but not untrained, participants. However, there were no signiﬁcant

improvements in the identiﬁcation of medial vowels in /bVd/ syllables or intervocalic consonants.

Additional tests were performed with speech materials manipulated so as to isolate the contribution

of various speech features. These showed that preserving intonational contrasts did not contribute

to the comprehension of spectrally-rotated speech after training, and suggested that improvements

involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on

speech features relatively unaffected by the transformation.

2013 Acoustical Society of America .

[http://dx.doi.org/10.1121/1.4812759]

PACS number(s): 43.71.Sy [JMH] Pages: 1369–1377

I. INTRODUCTION

Listeners possess considerable abilities to adapt to trans-

formations which, to various extents, degrade and distort im-

portant features of speech signals. Examples include noise-

or tone-excited vocoding (Hill et al., 1968; Shannon et al.,

1995; Dorman et al., 1997a), sine-wave speech (Remez

et al., 1981), time-compressed speech (Dupoux and Green,

1997), and spectral rotation (Blesser, 1972). The speed

and degree of adaptation vary across transformations.

Investigation of factors that contribute to adaptation and its

limitations may provide valuable insights into perceptual

learning processes and inform models of speech perception.

Adaptation to noise- or tone-vocoded speech has

received considerable interest, not least because this type of

processing has features in common with the processing typi-

cally applied in cochlear implant systems (Davis et al.,

2005; Hervais-Adelman et al., 2008; Hervais-Adelman

et al., 2011; Loebach et al., 2010). Spectral resolution is lim-

ited to a small number of broad frequency bands, temporal

ﬁne structure is eliminated, but amplitude envelopes within

each frequency band are preserved. Learning of speech that

has been tone- or noise-vocoded but not subject to other dis-

tortion is very rapid. For example, using six-channel noise-

vocoded speech, Davis et al. (2005) reported that sentence

recognition improved from near zero to 70% words correct

over the course of presentation of just 30 sentences. One im-

portant ﬁnding has been that improvement after training is

seen for words that were not heard during training, suggest-

ing that learning involves modiﬁcation of the processing of

phonetic cues at a sublexical level (Davis et al., 2005).

Similarly, with time-compressed speech, the degree to which

learning transfers across languages has been found to depend

on the phonological similarity between the languages

(Pallier et al., 1998), suggesting that learning occurred at the

phonetic level.

The rapidity of adaptation to vocoded speech probably

reﬂects the fact that, while ﬁne spectral detail is lost, the over-

all shape and position of the spectral envelope are well pre-

served. For cochlear implant users, the representation

of speech spectral information is subject to more complex

transformations than those involved in straightforward vocod-

ing. For example, post-lingually deafened cochlear implant

users must adapt to some change of frequency to place map-

ping, since it is highly unlikely that all of the electrode con-

tacts will be at tonotopically correct places. Typically, an

overall upward spectral shift will arise due to incomplete

insertion of the electrode array (Ketten et al.,1998). Short-

term studies using noise-excited vocoding in normal hearing

listeners have shown that such shifts in spectral envelope

have a highly detrimental effect on speech perception, far

beyond that imposed by vocoding per se, and largely inde-

pendent of the degree of spectral resolution (Dorman et al.,

1997b; Shannon et al.

, 1998). However, a few hours of train-

ing has been shown to be sufﬁcient to lead to signiﬁcant

improvements in sentence recognition for speech that has

been both noise-vocoded and spectrally-shifted (Faulkner

et al., 2006; Fu and Galvin, 2003; Rosen et al.,1999).

Similarly, Smith and Faulkner (2006) showed that adaptation

was possible to noise-vocoded speech in which the fre-

quency-to-place map was “warped.” This simulated a situa-

tion in which there is a “dead” cochlear region with no

functional neurons, and the frequency map is adjusted so as

to distribute spectral information from the whole signal over

the functioning cochlear regions on either side of the dead

region.

J. Acoust. Soc. Am. 134 (2), August 2013

2013 Acoustical Society of America 13690001-4966/2013/134(2)/1369/9/$30.00

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

The addition of spectral shifting or warping results in

more complex transformations of speech spectral informa-

tion than noise- or tone-vocoding alone. However, these are

still monotonic transformations that largely preserve the rel-

ative shape of the spectral envelope. Substantially greater

difﬁculties in adaptation might be anticipated for distortions

of speech signals that involve non-monotonic transforma-

tions of the spectral envelope. Some form of non-

monotonicity might occur in cochlear implant users due to a

range of factors that affect the extent to which current deliv-

ered to a particular electrode is effective in stimulating an

appropriate population of auditory nerve ﬁbers, such as the

possibility of cross-turn stimulation (e.g., Finley et al.,

2008). Such distortions are even more likely in users of audi-

tory brainstem implants (ABIs), where assignment of audi-

tory ﬁlters to electrodes relies on clinical pitch ranking

procedures. The difﬁculties inherent in such procedures

make it more difﬁcult to establish a tonotopic pattern of elec-

trical stimulation (Colletti et al., 2012).

An extreme example of a non-monotonic spectral trans-

formation is spectral rotation, in which the bandwidth of the

signal is ﬁrst restricted by low-pass ﬁltering and the spec-

trum is then inverted around the center frequency (Blesser,

1969, 1972; Azadpour and Balaban, 2008). Some speech

features are more or less unaffected by spectral rotation,

including amplitude envelope, the presence or absence of pe-

riodicity, and pitch variation that conveys intonation.

However, the inversion of spectral shape and dynamics

makes rotated speech completely unintelligible for naive lis-

teners. That spectrally-rotated speech retains many of the

acoustical properties of actual speech while being unintelli-

gible has led to its widespread use as a non-speech control in

neuroimaging studies of speech perception (e.g., Scott et al.,

2000).

However, it appears that considerable adaptation to this

extreme transformation is possible over a fairly short time.

Blesser (1969, 1972) provided experience with spectral rota-

tion in several 30-min sessions. Pairs of participants who

were well-known to each other heard each other’s speech

only in spectrally-rotated form and communicated purely by

auditory means, using any approach that they found practi-

cal. A range of speech tests, including vowel and consonant

perception and recognition of single words and sentences,

were carried out at various times during the course of the

experiment.

Learning was observed both for the identiﬁcation of

vowels and consonants and for comprehension of whole sen-

tences, with sentence scores reaching 35% syllables correct

on average after up to 10 h experience. A few points should

be noted here, however. First, in contrast to typical contem-

porary practice, scoring was based not just on key words but

on all syllables within a sentence. Second, participants were

tested with the same sentence list on repeated occasions.

Finally, there was large variability across participants, which

may in part reﬂect differences in the approaches to learning

adopted by different pairs of participants. Blesser speculated

that there may be a particularly important role for intonation

in the comprehension of spectrally-rotated speech, although

this was not tested directly. Spectral rotation of voiced

speech destroys the original harmonic spectral structure

since the fundamental frequency and all its harmonics are

transposed to different frequencies. However, while the new

spectral components are no longer integer multiples of a

common fundamental frequency (F0), the spacing between

them remains equal to the original F0. This gives rise to rela-

tively strong pitch percepts which rise and fall in the same

pattern as the pitch of the original speech (Plomp, 1967).

Thus, the shape of intonation contours, and the prosodic in-

formation that they convey, are well preserved after spectral

rotation and may contribute to adaptation. This contrasts

with noise-excited vocoded speech in which voice pitch in-

formation is severely degraded or non-existent, depending

upon the details of the processing (Green et al., 2002; Souza

and Rosen, 2009).

Blesser’s work would appear to suggest that a substan-

tial degree of adaptation to an extreme distortion of spectral

shape and dynamics is possible. This would represent a strik-

ing example of plasticity in the perception of a fundamental

acoustic property essential for speech understanding. Such

plasticity may be of relevance to some users of auditory

prostheses, who may experience severe spectral transforma-

tions, albeit not as extreme as spectral rotation. It may also

have implications in relation to the use of spectral rotation as

a non-speech control in imaging studies. However, the

uncontrolled nature of Blesser’s procedures makes it difﬁcult

to be sure of the underlying processes. For example, it is not

clear to what extent improvements in sentence recognition

over the course of Blesser’s experiment were attributable

speciﬁcally to adaptation to altered spectral dynamics, rather

than learning to make use of relatively well preserved fea-

tures of transformed speech, or to increasing familiarity with

test materials and procedures. The latter point is important

since, using spectrally-shifted, noise-vocoded speech,

Stacey

and Summerﬁeld (2008) showed that substantial improve-

ments in sentence recognition and phoneme identiﬁcation

occurred due to repeated testing without any intervening

training.

Here, we investigate adaptation to spectrally-rotated

speech in more controlled conditions than those used by

Blesser (1969, 1972). We employed Connected Discourse

Tracking (CDT) (De Filippo and Scott, 1978), a training

method previously found effective for spectrally-shifted

speech (Rosen et al., 1999). Regular tests of phoneme and

sentence recognition probed the course of learning and test

conditions were included in which stimuli were manipulated

so as to assess the contribution of features largely unaffected

by spectral rotation, such as the shape of intonation contours

and the presence or absence of periodicity. Based on

Blesser’s ﬁndings it is hypothesized that participants who

receive training will show signiﬁcantly larger improvements

in perception of spectrally-rotated consonants, vowels, and

sentences than participants who receive no training. In addi-

tion, if learning does not involve adaptation to altered spec-

tral shape and dynamics, but merely reﬂects enhanced use of

speech features preserved by spectral rotation, beneﬁts of

training would still be expected for stimuli in which ampli-

tude envelope, pitch, and periodicity information is pre-

served, while spectral variation is eliminated. For the

1370 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

particular feature of intonation, conversely, performance for

the trained group would be signiﬁcantly reduced for stimuli

processed so as to eliminate intonation cues.

II. METHODS

A. Participants

Eight normal hearing adults participated after giving

informed consent. All were aged between 20 and 30 and 5

were male. Four of the participants (T1–T4) received train-

ing with spectrally-rotated speech, while the remaining four

did not. One of the untrained participants was bilingual in

English and Gujarati, while the remainder had English as

their sole native language.

B. Speech tests

1. Consonants

Recordings of 20 consonants [m n w r l j b p d t g k t SS

D s f ð z v], in VCV format, spoken by 1 male and 1 female

speaker of Southern Standard British English were available.

Three different vowel contexts were used (/i/, /u/, and /A/),

resulting in a total of 120 stimuli. Participants responded

using a mouse to click on 1 of 20 orthographically-labeled

buttons displayed on a computer screen.

2. Vowels

Recordings of 17 vowels in /bVd/ context, spoken by

the same male and female speakers of Southern Standard

British English were available. There were ten mono-

phthongs [/æ/ (bad), /A+/ (bard), /i+/ (bead), /E/ (bed), /I/

(bid), /˘+/ (bird), /`/ (bod), /O+/ (board), /u+/ (booed), /ˆ/

(bud)] and seven diphthongs [/e@/ (bared), /eI/ (bayed), /I@/

(beard), /aI/ (bide), /@U/ (bode), /aU/ (boughed), /OI/ (Boyd)].

Two tokens from each speaker for each vowel were used,

giving a total of 68 stimuli. Participants responded by click-

ing on 1 of 17 on-screen buttons orthographically labeled

with the full words.

3. Sentences

Two sets of sentence materials were used. One com-

prised video recordings of Bamford-Kowal-Bench (BKB)

sentences (Bench et al., 1979), read by a female speaker of

Southern Standard British English. In some conditions these

were presented audiovisually, while in others only the sound

was presented. The other consisted of audio-only recordings

of the Adaptive Sentence List (ASL) sentences (MacLeod

and Summerﬁeld, 1990) read by a male speaker of Southern

Standard British English. Like BKB sentences, the ASL

materials are short, highly predictable sentences, e.g., “The

bag was very heavy.” The speakers were different from those

who produced the consonant and vowel materials. Twenty-

one BKB lists, each containing 16 sentences and 50 key-

words, and 18 ASL lists, each containing 15 sentences and

45 key words, were used. After a sentence was presented,

participants repeated whatever words they thought they had

heard. The experimenter then recorded the number of key

words correct, applying a loose scoring method in which a

response was scored as correct if its root matched that of the

key word.

C. Signal processing and equipment

Participants listened via Sennheiser headphones (HD 25

SP). Spectral rotation was performed in real time using

the software system Aladdin (Hitech AB, Sweden) and a

digital-signal-processing PC card (Loughborough Sound

Images TMS320C31) running at a sampling rate of

22.05 kHz. Input speech was ﬁrst low-pass ﬁltered at 4 kHz

using a tenth-order elliptic ﬁlter. Additional ﬁltering (33-

point ﬁnite impulse response) was applied in order to mini-

mize differences in the long-term spectra of rotated and nor-

mal speech. The design of this additional ﬁlter was based

largely on published measurements of the long-term average

speech spectrum (Byrne et al., 1994), although the roll-off

below 120 Hz was ignored, and a ﬂat spectrum below

420 Hz assumed. Spectral rotation around 2 kHz was then

implemented via modulation with a 4-kHz sinusoid. In order

to remove upper frequency side bands, the modulated signal

was low-pass ﬁltered again with the same elliptic ﬁlter. The

total root-mean-square level of the spectrally-rotated signal

was set equal to that of the original low-pass ﬁltered signal.

D. Additional manipulations probing the role

of speech features unaffected by spectral rotation

In order to assess the contribution of intonation to the

comprehension of spectrally-rotated speech, sentence tests

were included in which, prior to spectral rotation, stimuli

were manipulated so as to eliminate intonation information.

The Pitch Synchronous Overlap Add technique (Moulines

and Charpentier, 1990), as implemented in Praat (Boersma

and Weenink, 2001) was used to replace the natural pitch

contours of the BKB and ASL sentences with a monotone at

230 Hz for the female talker and at 150 Hz for the male

talker. As a control for possible artifacts introduced by Praat,

a further condition was included in which the natural pitch

contours were shifted, up by 3.5 semitones for the male

talker and down by 3.5 semitones for the female talker.

These conditions will subsequently be referred to respec-

tively as “Monotone” and “Shifted.”

An alternative approach to assessing the relative contri-

butions of spectral dynamics and features unaffected by spec-

tral rotation involved eliminating spectral variation while

preserving amplitude envelope and pitch and periodicity in-

formation. This manipulation, previously used by Faulkner

and Rosen (1999), was implemented in real time in Aladdin

by the use of a second input signal comprising pulses occur-

ring once per pitch period. This pulse input triggered the gen-

eration of a pulse train carrier within the DSP system which

was then modulated by an amplitude envelope extracted from

the speech signal (bandpass ﬁltered between 50 Hz and

3 kHz, 6-dB per octave roll-off). Envelope extraction

employed full-wave rectiﬁcation and a 32-Hz low-pass ﬁlter

(fourth-order elliptic). During unvoiced speech segments a

J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1371

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

white-noise carrier was used. Mixed excitation sounds (e.g.,

/z/) led to a voiced output alone. Finally, spectral rotation

was applied as described above. This condition will subse-

quently be referred to as “PP” (for “Pitch and Periodicity”).

A further condition, tested only in two of the trained par-

ticipants (see Sec. II F below), assessed the possibility that

improvements in sentence recognition with training primar-

ily reﬂected an enhanced ability to take advantage of infor-

mation from the narrow band of frequencies centered on the

frequency around which the spectrum was rotated, which are

relatively unaffected by the transformation. Prior to spectral

rotation, sentences were subject to ﬁltering with steep cutoffs

(Chebyshev type II) to restrict the signal to a single band

centered at 2 kHz. The bandwidth was 240 Hz, correspond-

ing to the equivalent rectangular bandwidth of the auditory

ﬁlter (Moore and Glasberg, 1983).

E. Training

CDT was implemented in a similar way to Rosen et al.

(1999). A single female speaker of Southern British English

(one of the authors, R. Paterson) was the talker for all four

trained participants. The materials used were books drawn

from the Heinemann Guided Readers series aimed at learn-

ers of English. The trainer and the participant were seated in

adjacent sound-proof rooms separated by a double-pane

glass partition. The trainer read phrases from the text which

the participant was required to repeat verbatim. Following

an accurate response, the trainer moved on to the next

phrase. If an error was made, the trainer repeated all or part

of the phrase, and the participant responded again. If the

phrase had not been accurately repeated after three attempts,

it was presented unprocessed before moving on to the next

phrase. Performance on the task was assessed in terms of the

rate, in words/min, at which the participant was able to cor-

rectly repeat the phrases spoken by the trainer. A low-level

pink noise was present in the participant’s room to mask any

of the trainer’s natural speech not sufﬁciently attenuated by

the intervening wall. Approximately half the training was

carried out with the participant able to see the speaker’s

face, while for the remainder, the glass partition between the

two rooms was covered and the participant received only

audio input. The addition of visual input was intended to

expedite learning, particularly in the early stages of training

when performance with the audio signal alone was expected

to be very poor.

F. Procedure

Trained participants completed four sessions of speech

perception testing. The ﬁrst three sessions took place at

approximately weekly intervals. There was a shorter gap,

typically 3 to 4 days, between the third and fourth sessions.

They underwent a total of 6 h of CDT with spectrally-rotated

speech: 3 h between the ﬁrst and second testing sessions, and

3 h between the second and third testing sessions. No train-

ing took place between the third and fourth testing sessions.

Training was completed in eight 45-min sessions, divided

into 5-min blocks which alternated between audiovisual and

audio-only presentation, with the exception that the ﬁrst four

blocks in the initial training session all used audiovisual pre-

sentation. Untrained participants completed the same tests as

the trained group, but over a shorter period of time, typically

within a few days.

The ﬁrst three testing sessions contained tests of the per-

ception of spectrally-rotated speech, using all the different

speech materials. In each session the 120 VCV stimuli were

presented once and the 68 vowel stimuli twice. Four BKB

sentence lists (female talker) were presented, two audiovisu-

ally and two audio-only. Two ASL sentence lists (male

talker) were presented audio-only. In addition, in the ﬁrst

testing session vowel and consonant perception were

assessed with stimuli low-pass ﬁltered at 4 kHz, but other-

wise unprocessed. No feedback was given during any of the

testing. The order in which the four different types of tests

were conducted by each participant was based on random-

ized Latin Squares. Sentence lists were chosen at random

(without replacement) for each participant.

The fourth testing session examined speech recognition

in the conditions with additional manipulations intended to

elucidate the factors underlying learning of spectrally-

rotated speech. Sentence recognition was assessed audio-

only in Shifted, Monotone, and PP conditions (one BKB and

two ASL lists in each condition).

In addition, two of the four trained participants (T1 and

T4) carried out additional testing approximately 7 weeks af-

ter the fourth testing session. Sentence recognition was

tested both for spectrally-rotated speech as previously expe-

rienced, to assess retention of learning, and for speech ﬁl-

tered into a single spectral band around 2 kHz. Prior to

testing they were given a brief “reminder” CDT session

comprising one 5-min block of audiovisual training followed

by four 5-min blocks of audio-only training.

III. RESULTS

A. Connected discourse tracking

Figure 1 shows how performance in CDT training

changed over the eight 45-min training sessions. As would

be expected, performance was better in the audiovisual con-

dition than with audio-only presentation, reﬂecting the avail-

ability of speech-reading cues. In the audiovisual condition,

performance reached a plateau around the fourth training

session at approximately 80 words/min. This is some way

short of the maximum rate previously found with normal

speech and full visual and acoustic cues of between 110 and

130 words/min, depending on the complexity of the training

test (De Filippo and Scott, 1978; Hochberg et al., 1989).

This suggests that although tracking rates ceased to improve

over the ﬁnal few training sessions, adaptation was not com-

plete. In the audio-only condition performance continued to

improve until the ﬁnal one or two sessions. The lack of

improvement in the ﬁnal session may partly reﬂect the fact

that, during this session, three of the four participants

reached the end of the book that had been used since the start

of training and therefore had to adjust to new material.

Linear regressions on words correct per minute against

training session in audio-only conditions showed that there

was a highly signiﬁcant increase in performance across

1372 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

session (p < 0.001 for all four participants). For three of the

four participants the regression slopes were similar, corre-

sponding to an increase of between 15 and 20 words/min for

each hour of training. The remaining participant (T3) learned

considerably more slowly with a regression slope showing

an increase of only 4 words/min for every hour of training.

B. Consonants

Consonant identiﬁcation data is shown in Fig. 2.

Performance was near ceiling with stimuli which were

unprocessed beyond low-pass ﬁltering at 4 kHz. For

spectrally-rotated stimuli, performance was very low in all

three testing sessions for both groups with both talkers, with

no evidence of learning for the trained group. Here, and sub-

sequently, data were analyzed using mixed effects linear

modeling. This enables the amount of training to be treated

as a continuous variable and also allows incorporation of

scores from multiple lists in the same condition. Data from

the spectrally-rotated stimuli were analyzed with factors of

talker (male or female), group (trained or control), and test

session number. No signiﬁcant main effects or interactions

were observed {main effect of talker [F(1,40) ¼ 1.64,

p ¼ 0.208], interaction between talker and session number

[F(1,40) ¼ 1.17, p ¼ 0.285], all other F’s < 1}.

C. Vowels

As shown in Fig. 3, vowel identiﬁcation data showed a

similar pattern to consonant identiﬁcation: Near-ceiling per-

formance on stimuli that were merely low-pass ﬁltered at

4 kHz, very low performance in all tests with spectrally-

rotated stimuli, and no evidence of learning. An analysis

using a mixed effects linear model with factors of group,

talker, and session number showed no signiﬁcant main

effects or interactions {main effect of group [F(1,88) ¼ 1.07,

p ¼ 0.303], all other F’s < 1}.

D. Sentences

1. Spectral rotation only

Recognition of spectrally-rotated sentences with audio-

only presentation over the ﬁrst three test sessions will be

examined ﬁrst (Fig. 4). In contrast to the vowel and conso-

nant data, there was clear evidence of learning. In the ﬁrst

testing session performance was very low for both trained

and untrained groups. In subsequent testing sessions, how-

ever, while performance remained poor for the untrained

group, it steadily increased for the trained group. Increased

sentence recognition with training was apparent with both

test talkers, but was particularly pronounced for the female

talker.

All four trained participants showed improvements over

the different test sessions although there was considerable

variability in the extent of improvement. Consistent with the

CDT data the least improvement was shown by T3, for

whom the mean proportion of key words correct for the

FIG. 1. Boxplots of performance over the eight training sessions of CDT

with spectrally-rotated speech. Mean words correct per minute, averaged

across the 5 min segments within each training session, are shown for both

audiovisual and audio-only presentation.

FIG. 2. Boxplots of consonant identiﬁcation for trained and control groups

for the male and female talkers (top and bottom panels, respectively). The

two leftmost boxes show performance with low-pass ﬁltered but otherwise

unprocessed materials obtained in the ﬁrst testing session (LP). The two

leftmost boxes show performance with low-pass ﬁltered (LP) but otherwise

unprocessed materials obtained in the ﬁrst testing session.

J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1373

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

female talker increased from 0.01 to 0.11 over the three test

sessions. For the other three participants the increase in pro-

portion correct for the female talker ranged between 0.26

and 0.64. Beneﬁts from training were somewhat more con-

sistent for the male talker with increases in proportion cor-

rect for all four listeners ranging between 0.09 and 0.22.

Audio-only data from the ﬁrst three test sessions were

analyzed with a mixed effects linear model with factors of

talker, session, and group. The three-way interaction was

close to signiﬁcance [F(1,87) ¼ 3.17, p ¼ 0.078], as was the

two-way interaction between talker and session [F(1,87)

¼ 3.42, p ¼ 0.068]. The two-way interaction between talker

and group was not signiﬁcant [F < 1]. Most importantly, there

was a highly signiﬁcant two-way interaction between session

and group [F(1,87) ¼ 19.67, p < 0.001], reﬂecting the fact

that performance did not change for the untrained group, but

increased over time for the trained group. The main effect of

talker was not signiﬁcant [F(1,87) ¼ 2.69, p ¼ 0.105]. The

signiﬁcant interaction between group and session means that

analysis of their main effects is of little consequence, but for

completeness we observe that there was no signiﬁcant effect

of group [F(1,87) ¼ 2.92, p ¼ 0.091], but a highly signiﬁcant

effect of session [F(1,87) ¼ 23.93, p < 0.001].

With audiovisual presentation (female talker only) there

was also evidence of learning, although there was very large

variability in performance levels before training and ceiling

effects occurred for the trained group. Despite these limita-

tions, a mixed effects linear analysis showed evidence of

learning in the form of a signiﬁcant interaction between ses-

sion and group [F(1,44) ¼ 5.65, p ¼ 0.022]. Data from audio-

visual conditions will not be considered further.

2. Spectral rotation with additional signal

manipulations

As shown in Fig. 4, performance was also better for the

trained than the untrained group in the Monotone and

Shifted conditions. For the male talker there was little differ-

ence between performance in these conditions and that in the

third test session with spectrally-rotated speech, conducted

after 6 h of training. For the female talker the Monotone con-

dition did produce poorer performance for the trained group

than that in the third test session with spectrally-rotated

speech. However, for three of the four participants a very

similar decrement in performance was also apparent in the

Shifted condition. Since the intonation information con-

tained in the Shifted condition is very similar to that in the

speech subjected only to spectral rotation, it is likely that the

decrement in the Monotone condition was primarily attribut-

able to artifacts of the voice pitch manipulation process,

rather than to the absence of intonation information.

Data from the Monotone and Shifted conditions and

from the third test session with spectrally-rotated speech

were submitted to a mixed effects linear analysis with

factors of group, talker, and condition. There was a highly

signiﬁcant effect of group [F(1,68) ¼ 37.09, p < 0.001],

but no other main effects or interactions were signiﬁcant

FIG. 3. Boxplots of vowel identiﬁcation for trained and control groups for

the male and female talkers (top and bottom panels, respectively). The two

leftmost boxes show performance with low-pass ﬁltered but otherwise

unprocessed materials obtained in the ﬁrst testing session (LP). The remain-

ing boxes show performance with spectrally-rotated materials obtained over

the ﬁrst three testing sessions.

FIG. 4. Boxplots of audio-only sentence recognition for the male and female

talkers. Each panel shows (from left to right) performance with spectrally-

rotated sentences in the ﬁrst three testing sessions, and with pitch-shifted

and monotone sentences from the fourth testing session.

1374 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

{three-way interaction [F(1,68) ¼ 1.56, p ¼ 0.218], interac-

tion between group and talker [F(1,68) ¼ 2.40, p ¼ 0.126],

interaction between group and condition [F(1,68) ¼ 1.17,

p ¼ 0.318], all other F’s < 1}.

The elimination of spectral variation implemented in the

PP condition led to ﬂoor effects for both trained and

untrained groups. Across the two groups, 21 out of a total of

24 runs produced zero key words correct.

Figure 5 shows audio-only sentence recognition for the

two trained participants who were available for further test-

ing approximately 7 weeks after completing the main experi-

ment. Their performance in the third testing session is shown

for comparison and it is clear that there was no decrement in

performance over the intervening period. Also shown in Fig.

5 is that performance was at ﬂoor when signals were re-

stricted to a single spectral band centered at 2 kHz.

IV. DISCUSSION

Audio-only sentence recognition improved for the

trained group but not for the untrained group. This conﬁrms

that adaptation to spectrally-rotated speech is possible and

shows that the improvement was not attributable to repeated

exposure to the test procedures. Improved sentence recogni-

tion after training was observed for two different talkers,

both of whom were different to the training talker, indicating

generalization of learning across talkers. Recognition of

spectrally-rotated sentences in two subjects available for

follow-up testing did not differ from that obtained immedi-

ately after training, suggesting that learning was robust over

a period of several weeks. Data obtained with additional

stimulus manipulations, discussed in more detail below,

were consistent with the idea that improvements with

training did not simply reﬂect better use of those, primarily

temporal, speech features that are relatively well preserved

after spectral rotation. These ﬁndings demonstrate that quite

rapid adaptation is possible to a radical transformation of the

representation of critical speech spectral information.

Spectral rotation is clearly not directly comparable to the

transformations that might be experienced by users of audi-

tory prostheses. However, this does suggest that non-

monotonic transformations of spectral information, such as

might be experienced particularly by ABI users, do not, per

se, preclude the regaining of substantial levels of speech

recognition.

It was conceivable that improvements after training

might involve participants learning to make better use of

speech information that was relatively unaffected by the

transformation, such as periodicity and envelope, while

ignoring distorted spectral information. However, this was

not supported by testing with stimulus manipulations that,

prior to spectral rotation, selectively eliminated or preserved

different aspects of speech information. Sentence recogni-

tion for the two re-tested participants was at ﬂoor for stimuli

restricted to the central frequency band around 2 kHz that

was largely unaffected by spectral rotation, showing that

they had not learned simply to ignore information from

transformed spectral regions. Performance was similarly

poor for stimuli in which spectral variation was eliminated,

while pitch, periodicity, and amplitude envelope were pre-

served. This suggests that post-training improvements in per-

ception of spectrally-rotated speech did reﬂect adaptation to

altered spectral information and is consistent with the notion

that access to spectral dynamics is critical to speech under-

standing (Rosen and Iverson, 2007). In addition, sentence

recognition was not signiﬁcantly poorer when natural F0

contours were replaced with monotones. Therefore, the pres-

ervation of intonation contour shape does not appear to be

critical to the comprehension of spectrally-rotated speech af-

ter training. It remains possible, however, that the presence

of near natural intonation patterns may be important during

the learning process through, for example, providing helpful

cues to segmentation and syntactic structure.

Somewhat surprisingly, despite the improvements in

CDT rates and sentence recognition with training, there were

no signiﬁcant improvements in medial vowel or intervocalic

consonant identiﬁcation, suggesting a particularly important

role for contextual information (Boothroyd and Nittrouer,

1988). One possibility is that the multiple constraints on

word choices provided by simple, predictable sentences

make these materials sensitive to improvements in the per-

ception of speech features that are too small to be manifested

in tests of isolated phoneme identiﬁcation. However, it

should be noted that this aspect of our data contrasts with

Blesser (1969, 1972), where there were improvements in

both vowel and consonant recognition. Methodological fac-

tors might be important here. For example, Blesser’s partici-

pants completed tests of vowel and consonant discrimination

prior to identiﬁcation tests. It should, however, also be noted

that the improvements in vowel or consonant identiﬁcation

observed by Blesser were quite small and that there was little

correlation with improvements in sentence recognition.

FIG. 5. Boxplots of audio-only sentence recognition for the male and female

talkers from the additional testing carried out with two trained participants,

7 weeks after the end of the initial testing. This included simple spectral

rotation and also conditions in which stimuli were restricted to a single fre-

quency band centered at 2 kHz. For comparison the performance of these

two participants with spectrally-rotated sentences in the third testing session

is shown on the left-hand side of the ﬁgure.

J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1375

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

Studies of the learning of spectrally-shifted noise- or

tone-vocoded speech have also produced an inconsistent pic-

ture regarding the relationship between sentence recognition

and phoneme identiﬁcation, measured using VCV and bVd

or hVd materials. Using CDT training Rosen et al. (1999)

found large improvements in recognition of BKB sentences

and also signiﬁcant improvements in vowel and consonant

identiﬁcation. Fu et al. (2005) implemented an interactive,

computer-based training method using unconnected HINT

sentences (Nilsson et al., 1994) and found signiﬁcant

improvements for consonant, but not vowel, identiﬁcation.

Unfortunately, Fu et al. (2005) did not include sentence tests,

but large increases in sentence recognition during training

were reported. Using a similar training method Stacey and

Summerﬁeld (2008) found signiﬁcant improvements in rec-

ognition of BKB and IEEE sentences, but no signiﬁcant

effect of training on either vowel or consonant identiﬁcation.

Sentence recognition, of course, involves a wider range

of cognitive skills operating on lexical, semantic, and syntac-

tic information, which introduces more variability across

participants. It may also be relevant that the two aforemen-

tioned studies that showed signiﬁcant improvement for both

vowel and consonant identiﬁcation (Blesser, 1972; Rosen

et al. , 1999) used a single talker for each type of test material

whereas the remainder, including the present study, used

multiple talkers within vowel and consonant tests. It may be

that trial-to-trial variation of talkers depresses performance

in tests of brief isolated phonemes which provide very little

time to adjust to a different talker.

The ﬁnding that learning generalized across talkers

indicates that adaptation was occurring at a level of abstrac-

tion beyond the particular acoustic-phonetic patterns pro-

duced by the training talker. Generalization to new talkers

would likely be increased by using a number of different

talkers during training (Stacey and Summerﬁeld, 2007).

Further research might explore which aspects of speech

processing are being modiﬁed during learning of spectrally-

rotated speech. In the context of vocoded or synthetic

speech, there has been considerable research examining the

transfer of perceptual learning of speech to contexts that

were not experienced in training, providing information

about the levels of processing at which training related

changes occur (e.g., Francis et al., 2007; Dahan and Mead,

2010, Hervais-Adelman et al., 2011). Such work suggests

that while there is more or less complete generalization

across some acoustic features, there are also context-

dependent aspects of learning. For example, Hervais-

Adelman et al. (2011) showed that learning of vocoded

speech transferred to an untrained frequency region, but that

learning only partly generalized across different carrier sig-

nals used in vocoding. Dahan and Mead (2010) found that

after training with noise-vocoded monosyllables, perception

of consonants in test stimuli differed according to whether

they appeared in the same position, or ﬂanked by the same

vowel, as in the training stimuli. There is also evidence that

lexical information plays an important role in learning of

vocoded speech (Stacey and Summerﬁeld, 2008), but that

semantic context is not essential for learning (Davis et al.,

2005; Loebach et al., 2010).

It would appear reasonable to expect a considerable

overlap between the processes involved in adapting to

spectrally-shifted vocoded speech and spectrally-rotated

speech. However, the transformations do differ substantially,

e.g., in the contrasting extent to which information about

intonation and spectral dynamics is preserved, and it remains

possible that there may be aspects of learning that are spe-

ciﬁc to spectral rotation. Other unresolved issues include the

extent to which further improvements in speech perception

might be possible with long-term experience with extreme

spectral transformations, and the degree to which short-term

adaptation might occur with passive exposure to spectral

rotation rather than speciﬁc training. However, it does

appear that some caution may be necessary in the treatment

of spectrally-rotated speech as a non-speech control in neu-

roimaging research, in particular when relatively long expo-

sure periods are used.

V. CONCLUSIONS

Considerable adaptation to an extreme form of distor-

tion of spectral information was possible with a few hours

experience. Learning did appear to involve adaptation to

altered spectral shape and dynamics. The fact that intona-

tional contrasts are well preserved after the transformation

did not appear to be important for the comprehension of

spectrally-rotated speech after training, though it remains

possible that intonation might contribute to the process of

adaptation.

ACKNOWLEDGMENT

This work was partially supported by Action on Hearing

Loss (Grant No. G53).

Azadpour, M., and Balaban, E. (2008). “Phonological representations are

unconsciously used when processing complex, non-speech signals,” PLoS

ONE 3, e1966.

Bench, J., Kowal, A., and Bamford, J. (1979). “The BKB (Bamford-Kowal-

Bench) sentence lists for partially-hearing children,” Br. J. Audiol. 13,

108–112.

Blesser, B. (1969). “Perception of spectrally rotated speech,” Ph.D. disserta-

tion, Massachusetts Institute of Technology, Cambridge, MA.

Blesser, B. (1972). “Speech perception under conditions of spectral transfor-

mation. 1. Phonetic characteristics,” J. Speech Hear. Res. 15, 5–41.

Boersma, P., and Weenink, D. (2001). “Praat: Doing phonetics by computer

(version 3.9.28) [computer program],” http://www.praat.org (Last viewed

May 2001).

Boothroyd, A., and Nittrouer, S. (1988). “Mathematical treatment of context

effects in phoneme and word recognition,” J. Acoust. Soc. Am. 84,

101–114.

Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R.,

Hagerman, B., Hetu, R., Kei, J., Lui, C., Kiessling, J., Kotby, M. N.,

Nasser, N. H. A., El Kholy, W. A. H., Nakanishi, Y., Oyer, H., Powell, R.,

Stephens, D., Meredith, R., Sirimanna, T., Tavartkiladze, G., Frolenkov,

G. I., Westerman, S., and Ludvigsen, C. (1994). “An international compar-

ison of long-term average speech spectra,” J. Acoust. Soc. Am. 96,

2108–2120.

Colletti, L., Shannon, R., and Colletti, V. (2012). “Auditory brainstem

implants for neuroﬁbromatosis type 2,” Curr. Opin. Otolaryngol. Head

Neck Surg. 20, 353–357.

Dahan, D., and Mead, R. L. (2010). “Context-conditioned generalization in

adaptation to distorted speech,” J. Exp. Psychol. Hum. Percept. Perform.

36, 704–728.

Davis, M. H., Johnsrude, I. S., Hervais-Adelman, A., Taylor, K., and

McGettigan, C. (2005). “Lexical information drives perceptual learning of

1376 J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

distorted speech: Evidence from the comprehension of noise-vocoded

sentences,” J. Exp. Psychol. Gen. 134, 222–241.

De Filippo, C., and Scott, B. L. (1978). “Method for training and evaluating

reception of ongoing speech,” J. Acoust. Soc. Am. 63, 1186–1192.

Dorman, M. F., Loizou, P. C., and Rainey, D. (1997a). “Speech intelligibil-

ity as a function of the number of channels of stimulation for signal pro-

cessors using sine-wave and noise-band outputs,” J. Acoust. Soc. Am.

102, 2403–2411.

Dorman, M. F., Loizou, P. C., and Rainey, D. (1997b). “Simulating the

effect of cochlear-implant electrode insertion depth on speech under-

standing,” J. Acoust. Soc. Am. 102, 2993–2996.

Dupoux, E., and Green, K. (1997). “Perceptual adjustment to highly com-

pressed speech: Effects of talker and rate changes,” J. Exp. Psychol. Hum.

Percept. Perform. 23, 914–927.

Faulkner, A., and Rosen, S. (1999). “Contributions of temporal encodings of

voicing, voicelessness, fundamental frequency, and amplitude variation to

audio-visual and auditory speech perception,” J. Acoust. Soc. Am. 106,

2063–2073.

Faulkner, A., Rosen, S., and Norman, C. (2006). “The right information may

matter more than frequency-place alignment: Simulations of frequency-

aligned and upward shifting cochlear implant processors for a shallow

electrode array insertion,” Ear Hear. 27, 139–152.

Finley, C. C., Holden, T. A., Holden, L. K., Whiting, B. R., Chole, R. A.,

Neely, G. J., Hullar, T. E., and Skinner, M. W. (2008). “Role of electrode

placement as a contributor to variability in cochlear implant outcomes,”

Otol. Neurotol. 29, 920–928.

Francis, A. L., Nusbaum, H. C., and Fenn, K. (2007). “Effects of training on

the acoustic-phonetic representation of synthetic speech,” J. Speech Lang.

Hear. Res. 50, 1445–1465.

Fu, Q.-J., and Galvin, J. J. (2003). “The effects of short-term training for

spectrally mismatched noise-band speech,” J. Acoust. Soc. Am. 113,

1065–1072.

Fu, Q.-J., Nogaki, G., and Galvin, J. J. (2005). “Auditory training with

spectrally-shifted speech: Implications for cochlear implant patient audi-

tory rehabilitation,” J. Assoc. Res. Otolaryngol. 6, 180–189.

Green, T., Faulkner, A., and Rosen, S. (2002). “Spectral and temporal cues

to pitch in noise-excited vocoder simulations of continuous-interleaved-

sampling cochlear implants,” J. Acoust. Soc. Am. 112, 2155–2164.

Hervais-Adelman, A., Davis, M. H., Johnsrude, I. S., and Carlyon, R. P.

(2008). “Perceptual learning of noise vocoded words: Effects of feed-

back and lexicality,” J. Exp. Psychol. Hum. Percept. Perform. 34,

460–474.

Hervais-Adelman, A. G., Davis, M. H., Johnsrude, I. S., Taylor, K. J., and

Carlyon, R. P. (2011). “Generalization of perceptual learning of vocoded

speech,” J. Exp. Psychol. Hum. Percept. Perform. 37, 283–295.

Hill, J., McRae, P., and McClellan, R. (1968). “Speech recognition as a

function of channel capacity in a discrete set of channels,” J. Acoust. Soc.

Am. 44, 13–18.

Hochberg, I., Rosen, S., and Ball, V. (1989). “Effect of text complexity on

connected discourse tracking rate,” Ear Hear. 10, 192–199.

Ketten, D. R., Vannier, M. W., Skinner, M. W., Gates, G. A., Wang, G., and

Neely, J. G. (1998). “In vivo measures of cochlear length and insertion

depth of nucleus cochlear implant electrode arrays,” Ann. Otol. Rhin.

Laryngol. 107, 1–16.

Loebach, J. L., Pisoni, D. B., and Svirsky, M. A. (2010). “Effects of seman-

tic context and feedback on perceptual learning of speech processed

through an acoustic simulation of a cochlear implant,” J. Exp. Psychol.

Hum. Percept. Perform. 36, 224–234.

MacLeod, A., and Summerﬁeld, A. Q. (1990). “A procedure for measuring

auditory and audio-visual speech-reception thresholds for sentences in

noise: Rationale, evaluation, and recommendations for use,” Br. J. Audiol.

24, 29–43.

Moore, B. C. J., and Glasberg, B. R. (1983). “Suggested formulae for calcu-

lating auditory ﬁlter bandwidths and excitation patterns,” J. Acoust. Soc.

Am. 74, 750–753.

Moulines, E., and Charpentier, F. (1990). “Pitch-synchronous waveform

processing techniques for text-to-speech synthesis using diphones,”

Speech Commun. 9, 453–467.

Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “Development of the

hearing in noise test for the measurement of speech reception thresholds in

quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099.

Pallier, C., Sebastian-Gall



es, N., Dupoux, E., Christophe, A., and Mehler, J.

(1998). “Perceptual adjustment to time-compressed speech: A cross-

linguistic study,” Mem. Cognit. 26, 844–851.

Plomp, R. (1967). “Pitch of complex tones,” J. Acoust. Soc. Am. 41,

1526–1533.

Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981). “Speech

perception without traditional speech cues,” Science 212, 947–950.

Rosen, S., Faulkner, A., and Wilkinson, L. (1999). “Adaptation by normal

listeners to upward spectral shifts of speech: Implications for cochlear

implants,” J. Acoust. Soc. Am. 106, 3629–3636.

Rosen, S., and Iverson, P. (2007). “Constructing adequate non-speech ana-

logues: what is special about speech anyway?” Dev. Sci. 10, 165–168.

Scott, S. K., Blank, C. C., Rosen, S., and Wise, R. J. S. (2000).

“Identiﬁcation of a pathway for intelligible speech in the left temporal

lobe,” Brain 123, 2400–2406.

Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., and Ekelid, M.

(1995). “Speech recognition with primarily temporal cues,” Science 270,

303–304.

Shannon, R. V., Zeng, F.-G., and Wygonski, J. (1998

). “Speech recognition

with altered spectral distribution of envelope cues,” J. Acoust. Soc. Am.

104, 2467–2476.

Smith, M. W., and Faulkner, A. (2006). “Perceptual adaptation by normally

hearing listeners to a simulated ‘hole’ in hearing,” J. Acoust. Soc. Am.

120, 4019–4030.

Souza, P., and Rosen, S. (2009). “Effects of envelope bandwidth on the

intelligibility of sine- and noise-vocoded speech,” J. Acoust. Soc. Am.

126, 792–805.

Stacey, P. C., and Summerﬁeld, A. Q. (2007). “Effectiveness of computer-

based auditory training in improving the perception of noise-vocoded

speech,” J. Acoust. Soc. Am. 121, 2923–2935.

Stacey, P. C., and Summerﬁeld, A. Q. (2008). “Comparison of word-, sen-

tence-, and phoneme-based training strategies in improving the perception

of spectrally distorted speech,” J. Speech Lang. Hear. Res. 51, 526–538.

J. Acoust. Soc. Am., Vol. 134, No. 2, August 2013 Green et al.: Adaptation to spectrally-rotated speech 1377

Downloaded 03 Aug 2013 to 92.232.224.49. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

When Additional Training Isn't Enough: Further Evidence That Unpredictable Speech Inhibits Adaptation

Article

Full-text available

May 2020

Purpose Robust improvements in intelligibility following familiarization, a listener-targeted perceptual training paradigm, have been revealed for talkers diagnosed with spastic, ataxic, and hypokinetic dysarthria but not for talkers with hyperkinetic dysarthria. While the theoretical explanation for the lack of intelligibility improvement following training with hyperkinetic talkers is that there is insufficient distributional regularity in the speech signals to support perceptual adaptation, it could simply be that the standard training protocol was inadequate to facilitate learning of the unpredictable talker. In a pair of experiments, we addressed this possible alternate explanation by modifying the levels of exposure and feedback provided by the perceptual training protocol to offer listeners a more robust training experience. Method In Experiment 1, we examined the exposure modifications, testing whether perceptual adaptation to an unpredictable talker with hyperkinetic dysarthria could be achieved with greater or more diverse exposure to dysarthric speech during the training phase. In Experiment 2, we examined feedback modifications, testing whether perceptual adaptation to the unpredictable talker could be achieved with the addition of internally generated somatosensory feedback, via vocal imitation, during the training phase. Results Neither task modification led to improved intelligibility of the unpredictable talker with hyperkinetic dysarthria. Furthermore, listeners who completed the vocal imitation task demonstrated significantly reduced intelligibility at posttest. Conclusion Together, the results from Experiments 1 and 2 replicate and extend findings from our previous work, suggesting perceptual adaptation is inhibited for talkers whose speech is largely characterized by unpredictable degradations. Collectively, these results underscore the importance of integrating signal predictability into theoretical models of perceptual learning.

Preliminary evaluation of computer-assisted home training for French cochlear implant recipients

Article

Full-text available

Apr 2023
PLOS ONE

For French cochlear implant (CI) recipients, in-person clinical auditory rehabilitation is typically provided during the first few years post-implantation. However, this is often inconvenient, it requires substantial time resources and can be problematic when appointments are unavailable. In response, we developed a computer-based home training software ("French AngelSound™") for French CI recipients. We recently conducted a pilot study to evaluate the newly developed French AngelSound™ in 15 CI recipients (5 unilateral, 5 bilateral, 5 bimodal). Outcome measures included phoneme recognition in quiet and sentence recognition in noise. Unilateral CI users were tested with the CI alone. Bilateral CI users were tested with each CI ear alone to determine the poorer ear to be trained, as well as with both ears (binaural performance). Bimodal CI users were tested with the CI ear alone, and with the contralateral hearing aid (binaural performance). Participants trained at home over a one-month period (10 hours total). Phonemic contrast training was used; the level of difficulty ranged from phoneme discrimination in quiet to phoneme identification in multi-talker babble. Unilateral and bimodal CI users trained with the CI alone; bilateral CI users trained with the poorer ear alone. Outcomes were measured before training (pre-training), immediately after training was completed (post-training), and one month after training was stopped (follow-up). For all participants, post-training CI-only vowel and consonant recognition scores significantly improved after phoneme training with the CI ear alone. For bilateral and bimodal CI users, binaural vowel and consonant recognition scores also significantly improved after training with a single CI ear. Follow-up measures showed that training benefits were largely retained. These preliminary data suggest that the phonemic contrast training in French AngelSound™ may significantly benefit French CI recipients and may complement clinical auditory rehabilitation, especially when in-person visits are not possible.

Rapid but specific perceptual learning partially explains individual differences in the recognition of challenging speech

Article

Full-text available

Jun 2022

Perceptual learning for speech, defined as long-lasting changes in speech recognition following exposure or practice occurs under many challenging listening conditions. However, this learning is also highly specific to the conditions in which it occurred, such that its function in adult speech recognition is not clear. We used a time-compressed speech task to assess learning following either brief exposure (rapid learning) or additional training (training-induced learning). Both types of learning were robust and long-lasting. Individual differences in rapid learning explained unique variance in recognizing natural-fast speech and speech-in-noise with no additional contribution for training-induced learning (Experiment 1). Rapid learning was stimulus specific (Experiment 2), as in previous studies on training-induced learning. We suggest that rapid learning is key for understanding the role of perceptual learning in online speech recognition whereas longer training could provide additional opportunities to consolidate and stabilize learning.

The time course of adaptation to distorted speech

Article

Full-text available

Apr 2022

When confronted with unfamiliar or novel forms of speech, listeners' word recognition performance is known to improve with exposure, but data are lacking on the fine-grained time course of adaptation. The current study aims to fill this gap by investigating the time course of adaptation to several different types of distorted speech. Keyword scores as a function of sentence position in a block of 30 sentences were measured in response to eight forms of distorted speech. Listeners recognised twice as many words in the final sentence compared to the initial sentence with around half of the gain appearing in the first three sentences, followed by gradual gains over the rest of the block. Rapid adaptation was apparent for most of the eight distortion types tested with differences mainly in the gradual phase. Adaptation to sine-wave speech improved if listeners had heard other types of distortion prior to exposure, but no similar facilitation occurred for the other types of distortion. Rapid adaptation is unlikely to be due to procedural learning since listeners had been familiarised with the task and sentence format through exposure to undistorted speech. The mechanisms that underlie rapid adaptation are currently unclear.

Same or Different? Perceptual Learning for Connected Speech Induced by Brief and Longer Experiences

Preprint

Full-text available

Sep 2021

Perceptual learning, defined as long-lasting changes in the ability to extract information from the environment, occurs following either brief exposure or prolonged practice. Whether these two types of experience yield qualitatively distinct patterns of learning is not clear. We used a time-compressed speech task to assess perceptual learning following either rapid exposure or additional training. We report that both experiences yielded robust and long-lasting learning. Individual differences in rapid learning explained unique variance in performance in independent speech tasks (natural-fast speech and speech-in-noise) with no additional contribution for training-induced learning (Experiment 1). Finally, it seems that similar factors influence the specificity of the two types of learning (Experiment 1 and 2). We suggest that rapid learning is key for understanding the role of perceptual learning in speech recognition under adverse conditions while longer learning could serve to strengthen and stabilize learning.

Feel the Noise: Relating Individual Differences in Auditory Imagery to the Structure and Function of Sensorimotor Systems

Article

Full-text available

Nov 2015

Humans can generate mental auditory images of voices or songs, sometimes perceiving them almost as vividly as perceptual experiences. The functional networks supporting auditory imagery have been described, but less is known about the systems associated with interindividual differences in auditory imagery. Combining voxel-based morphometry and fMRI, we examined the structural basis of interindividual differences in how auditory images are subjectively perceived, and explored associations between auditory imagery, sensory-based processing, and visual imagery. Vividness of auditory imagery correlated with gray matter volume in the supplementary motor area (SMA), parietal cortex, medial superior frontal gyrus, and middle frontal gyrus. An analysis of functional responses to different types of human vocalizations revealed that the SMA and parietal sites that predict imagery are also modulated by sound type. Using representational similarity analysis, we found that higher representational specificity of heard sounds in SMA predicts vividness of imagery, indicating a mechanistic link between sensory- and imagery-based processing in sensorimotor cortex. Vividness of imagery in the visual domain also correlated with SMA structure, and with auditory imagery scores. Altogether, these findings provide evidence for a signature of imagery in brain structure, and highlight a common role of perceptual-motor interactions for processing heard and internally generated auditory information.

The effects of exposure and training on the perception of time-compressed speech in native versus nonnative listeners

Article

Sep 2016

The present study investigated the effects of language experience on the perceptual learning induced by either brief exposure to or more intensive training with time-compressed speech. Native (n = 30) and nonnative (n = 30) listeners were each divided to three groups with different experiences with time-compressed speech: A trained group who trained on the semantic verification of time-compressed sentences for three sessions, an exposure group briefly exposed to 20 time-compressed sentences, and a group of naive listeners. Recognition was assessed with three sets of time-compressed sentences intended to evaluate exposure-induced and training-induced learning as well as across-token and across-talker generalization. Learning profiles differed between native and nonnative listeners. Exposure had a weaker effect in nonnative than in native listeners. Furthermore, native and nonnative trained listeners significantly outperformed their untrained counterparts when tested with sentences taken from the training set. However, only trained native listeners outperformed naive native listeners when tested with new sentences. These findings suggest that the perceptual learning of speech is sensitive to linguistic experience. That rapid learning is weaker in nonnative listeners is consistent with their difficulties in real-life conditions. Furthermore, nonnative listeners may require longer periods of practice to achieve native-like learning outcomes.

Getting the Cocktail Party Started: Masking Effects in Speech Perception

Article

Full-text available

Dec 2015

Spoken conversations typically take place in noisy environments, and different kinds of masking sounds place differing demands on cognitive resources. Previous studies, examining the modulation of neural activity associated with the properties of competing sounds, have shown that additional speech streams engage the superior temporal gyrus. However, the absence of a condition in which target speech was heard without additional masking made it difficult to identify brain networks specific to masking and to ascertain the extent to which competing speech was processed equivalently to target speech. In this study, we scanned young healthy adults with continuous fMRI, while they listened to stories masked by sounds that differed in their similarity to speech. We show that auditory attention and control networks are activated during attentive listening to masked speech in the absence of an overt behavioral task. We demonstrate that competing speech is processed predominantly in the left hemisphere within the same pathway as target speech but is not treated equivalently within that stream and that individuals who perform better in speech in noise tasks activate the left mid-posterior superior temporal gyrus more. Finally, we identify neural responses associated with the onset of sounds in the auditory environment; activity was found within right lateralized frontal regions consistent with a phasic alerting response. Taken together, these results provide a comprehensive account of the neural processes involved in listening in noise.

Voice categorization in the four-month-old human brain

Article

Dec 2023
CURR BIOL

Voices are the most relevant social sounds for humans and therefore have crucial adaptive value in development. Neuroimaging studies in adults have demonstrated the existence of regions in the superior temporal sulcus that respond preferentially to voices. Yet, whether voices represent a functionally specific category in the young infant’s mind is largely unknown. We developed a highly sensitive paradigm relying on fast periodic auditory stimulation (FPAS) combined with scalp electroencephalography (EEG) to demonstrate that the infant brain implements a reliable preferential response to voices early in life. Twenty-three 4-month-old infants listened to sequences containing non-vocal sounds from different categories presented at 3.33 Hz, with highly heterogeneous vocal sounds appearing every third stimulus (1.11 Hz). We were able to isolate a voice-selective response over temporal regions, and individual voice-selective responses were found in most infants within only a few minutes of stimulation. This selective response was significantly reduced for the same frequency-scrambled sounds, indicating that voice selectivity is not simply driven by the envelope and the spectral content of the sounds. Such a robust selective response to voices as early as 4 months of age suggests that the infant brain is endowed with the ability to rapidly develop a functional selectivity to this socially relevant category of sounds.

Perceptual Learning of Accented Speech

Article

Apr 2021

Individuals who are speaking in a second language tend to use the language in ways that differ from native speakers. As listeners build representations of nonnative‐accented speech, the need for explicit processing should decrease and fewer attentional resources should be necessary for listeners to access the lexical items intended by the nonnative speakers. A growing body of work suggests that listeners can adapt to nonnative speech after both long‐ and short‐term exposure to these speech varieties. The influence of both accent strength and listener experience on accuracy and processing speed were gradient and nonlinear. An important issue that has drawn more attention is how accent adaptation may change across the life span. Many divergences from native norms in nonnative speech involve shifts in category boundaries rather than category mismatches. Although nonnative speech introduces variability into the speech signal, it is well established that native speech also contains substantial variability.

An international comparison of long-term average speech spectra

Article

Full-text available

Oct 1996

The long-term average speech spectrum (LTASS) and some dynamic characteristics of speech were determined for 12 languages: English (several dialects), Swedish, Danish, German, French (Canadian), Japanese, Cantonese, Mandarin, Russian, Welsh, Singhalese, and Vietnamese. The LTASS only was also measured for Arabic. Speech samples (18) were recorded, using standardized equipment and procedures, in 15 localities for (usually) ten male and ten female talkers. All analyses were conducted at the National Acoustic Laboratories, Sydney. The LTASS was similar for all languages although there were many statistically significant differences. Such differences were small and not always consistent for male and female samples of the same language. For one-third octave bands of speech, the maximum short-term rms level was 10 dB above the maximum long-term rms level, consistent across languages and frequency. A ''universal'' LTASS is suggested as being applicable, across languages, for many purposes including use in hearing aid prescription procedures and in the Articulation Index.

Perceptual adjustment to time-compressed speech: A cross-linguistic study

Article

Full-text available

Jul 1998

Previous research has shown that, when hearers listen to artificially speeded speech, their performance improves over the course of 10–15 sentences, as if their perceptual system was “adapting” to these fast rates of speech. In this paper, we further investigate the mechanisms that are responsible for such effects. In Experiment 1, we report that, for bilingual speakers of Catalan and Spanish, exposure to compressed sentences in either language improves performance on sentences in the other language. Experiment 2 reports that Catalan/Spanish transfer of performance occurs even in monolingual speakers of Spanish who do not understand Catalan. In Experiment 3, we study another pair of languages— namely, English and French—and report no transfer of adaptation between these two languages for English—French bilinguals. Experiment 4, with monolingual English speakers, assesses transfer of adaptation from French, Dutch, and English toward English. Here we find that there is no adaptation from French and intermediate adaptation from Dutch. We discuss the locus of the adaptation to compressed speech and relate our findings to other cross-linguistic studies in speech perception.

Adaptation by normal listeners to upward spectral shifts of speech: Implications for cochlear implants

Article

Full-text available

Dec 1999

Multi-channel cochlear implants typically present spectral information to the wrong "place" in the auditory nerve array, because electrodes can only be inserted partway into the cochlea. Although such spectral shifts are known to cause large immediate decrements in performance in simulations, the extent to which listeners can adapt to such shifts has yet to be investigated. Here, the effects of a four-channel implant in normal listeners have been simulated, and performance tested with unshifted spectral information and with the equivalent of a 6.5-mm basalward shift on the basilar membrane (1.3-2.9 octaves, depending on frequency). As expected, the unshifted simulation led to relatively high levels of mean performance (e.g., 64% of words in sentences correctly identified) whereas the shifted simulation led to very poor results (e.g., 1% of words). However, after just nine 20-min sessions of connected discourse tracking with the shifted simulation, performance improved significantly for the identification of intervocalic consonants, medial vowels in monosyllables, and words in sentences (30% of words). Also, listeners were able to track connected discourse of shifted signals without lipreading at rates up to 40 words per minute. Although we do not know if complete adaptation to the shifted signals is possible, it is clear that short-term experiments seriously exaggerate the long-term consequences of such spectral shifts.

Generalization of Perceptual Learning of Vocoded Speech

Article

Full-text available

Nov 2010

Recent work demonstrates that learning to understand noise-vocoded (NV) speech alters sublexical perceptual processes but is enhanced by the simultaneous provision of higher-level, phonological, but not lexical content (Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008), consistent with top-down learning (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005; Hervais-Adelman et al., 2008). Here, we investigate whether training listeners with specific types of NV speech improves intelligibility of vocoded speech with different acoustic characteristics. Transfer of perceptual learning would provide evidence for abstraction from variable properties of the speech input. In Experiment 1, we demonstrate that learning of NV speech in one frequency region generalizes to an untrained frequency region. In Experiment 2, we assessed generalization among three carrier signals used to create NV speech: noise bands, pulse trains, and sine waves. Stimuli created using these three carriers possess the same slow, time-varying amplitude information and are equated for naïve intelligibility but differ in their temporal fine structure. Perceptual learning generalized partially, but not completely, among different carrier signals. These results delimit the functional and neural locus of perceptual learning of vocoded speech. Generalization across frequency regions suggests that learning occurs at a stage of processing at which some abstraction from the physical signal has occurred, while incomplete transfer across carriers indicates that learning occurs at a stage of processing that is sensitive to acoustic features critical for speech perception (e.g., noise, periodicity).

Context-Conditioned Generalization in Adaptation to Distorted Speech

Article

Full-text available

Jun 2010

People were trained to decode noise-vocoded speech by hearing monosyllabic stimuli in distorted and unaltered forms. When later presented with different stimuli, listeners were able to successfully generalize their experience. However, generalization was modulated by the degree to which testing stimuli resembled training stimuli: Testing stimuli's consonants were easier to recognize when they had occurred in the same position at training, or flanked by the same vowel, than when they did not. Furthermore, greater generalization occurred when listeners had been trained on existing words than on nonsense strings. We propose that the process by which adult listeners learn to interpret distorted speech is akin to building phonological categories in one's native language, a process where categories and structure emerge from the words in the ambient language without completely abstracting from them.

Identification of a pathway for intelligible speech in the left temporal lobe

Article

Dec 2000
BRAIN

S. K. Scott

A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use

Article

Oct 2009
Br J Audiol

The strategy for measuring speech-reception thresholds for sentences in noise advocated by Plomp and Mimpen (Audiology, 18, 43–52, 1979) was modified to create a reliable test for measuring the difficulty which listeners have in speech reception, both auditorily and audio-visually. The test materials consist of 10 lists of 15 short sentences of homogeneous intelligibility when presented acoustically, and of different, but still homogeneous, intelligibility when presented audio-visually, in white noise. Homogeneity was achieved by applying phonetic and linguistic principles at the stage of compilation, followed by pilot testing and balancing of properties. To run the test, lists are presented at signal-to-noise ratios (SNRs) determined by an up-down psychophysical rule so as to estimate auditory and audiovisual speech-reception thresholds, defined as the SNRs at which the three content words in each sentence are identified correctly on 50% of trials. These thresholds provide measures of a subject's speech-reception abilities. The difference between them provides a measure of the benefit received from vision. It is shown that this measure is closely related to the accuracy with which subjects lip-read words in sentences with no acoustical information. In data from normally hearing adults, the standard deviations (s.d.s) of estimates of auditory speech reception threshold in noise (SRTN), audio-visual SRTN, and visual benefit are 1.2, 2.0, and 2.3 dB, respectively. Graphs are provided with which to estimate the trade-off between reliability and the number of lists presented, and to assess the significance of deviant scores from individual subjects.

Praat: doing phonetics by computer (version 4.4.24) [computer program

Article

Nov 2004

Auditory brainstem implants for neurofibromatosis type 2

Article

Aug 2012
Curr Opin Otolaryngol Head Neck Surg

Neurofibromatosis type 2 (NF2) produces benign Schwann cell tumors on many cranial nerves, in particular on the vestibular portions of the VIIIn bilaterally. Removal of these vestibular schwannomas usually severs the auditory portion of the VIIIn, thus deafening the patients. The auditory brainstem implant (ABI) was designed to provide prosthetic electric stimulation of the cochlear nucleus in the brainstem to restore some hearing sensations to patients deafened by bilateral removal of vestibular schwannomas. This study will review the new developments and improving outcomes of the ABI. From its initial application in 1979 until about 2005, the ABI provided modest but useful auditory sensations to NF2 patients. However, application of the ABI in non-NF2 populations and in children with congenital malformations demonstrated better results, showing that the ABI could provide high levels of speech recognition. Recent results show excellent speech recognition in NF2 patients as well. This study will discuss the potential causes of the variability in ABI outcomes. ABIs activate neurons in the cochlear nucleus to recreate hearing sensations in people who have become deaf as a result of the loss of the auditory nerve. Most NF2 patients show functional hearing benefit from the ABI, with awareness and recognition of environmental sounds and enhancement of lipreading. It is now clear that ABIs can produce excellent speech recognition in some patients with NF2, allowing even conversational telephone use. Although the factors leading to this improved performance are not completely clear, these new results show that excellent hearing is possible for NF2 patients with the ABI.

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Article

Dec 1990
SPEECH COMMUN

We review in a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation (Charpentier and Moulines, 1988; Moulines and Charpentier, 1988; Hamon et al., 1989). These algorithms rely on a pitch-synchronous overlap-add (PSOLA) approach for modifying the speech prosody and concatenating speech waveforms. The modifications of the speech signal are performed either in the frequency domain (FD-PSOLA), using the Fast Fourier Transform, or directly in the time domain (TD-PSOLA), depending on the length of the window used in the synthesis process. The frequency domain approach is capable of a great flexibility in modifying the spectral characteristics of the speech signal, while the time domain approach provides very efficient solutions for the real time implementation of synthesis systems. We also discuss the different kinds of distortions involved in these different algorithms.

Adaptation to spectrally-rotated speech

Abstract and Figures

Recommended publications

Acoustic cues to final stop voicing for impaired-and normal hearing listeners

Comparing live to recorded speech in training the perception of spectrally-shifted noise-vocoded spe...

Resistance to learning binaurally mismatched frequency-to-place maps: Implications for bilateral sti...

Interactions Between Unsupervised Learning and the Degree of Spectral Mismatch on Short-Term Percept...

Audiovisual Cues and Perceptual Learning of Spectrally Distorted Speech