DataPDF Available

1.2109187

January 2016

January 2016

Authors:

Tanya Arbogast

Frederick Gallun

Oregon Health and Science University

Content uploaded by Frederick Gallun

Content may be subject to copyright.

The advantage of knowing where to listena)

Gerald Kidd, Jr.,b兲Tanya L. Arbogast, Christine R. Mason, and Frederick J. Gallun

Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University,

635 Commonwealth Avenue, Boston, Massachusetts 02215

共Received 31 May 2005; revised 9 September 2005; accepted 12 September 2005兲

This study examined the role of focused attention along the spatial 共azimuthal兲dimension in a

highly uncertain multitalker listening situation. The task of the listener was to identify key words

from a target talker in the presence of two other talkers simultaneously uttering similar sentences.

When the listener had no a priori knowledge about target location, or which of the three sentences

was the target sentence, performance was relatively poor—near the value expected simply from

choosing to focus attention on only one of the three locations. When the target sentence was cued

before the trial, but location was uncertain, performance improved signiﬁcantly relative to the

uncued case. When spatial location information was provided before the trial, performance

improved signiﬁcantly for both cued and uncued conditions. If the location of the target was certain,

proportion correct identiﬁcation performance was higher than 0.9 independent of whether the target

was cued beforehand. In contrast to studies in which known versus unknown spatial locations were

compared for relatively simple stimuli and tasks, the results of the current experiments suggest that

the focus of attention along the spatial dimension can play a very signiﬁcant role in solving the

PACS number共s兲: 43.66.Lj, 43.66.Dc, 43.66.Pn 关AJO兴Pages: 3804–3815

I. INTRODUCTION

There are many factors that can interfere with a listener

attempting to comprehend the speech of one particular talker

in the presence of competing talkers. The speech of the target

talker can be masked by other sounds, obscuring portions of

the message, leaving it so incomplete as to be meaningless or

misunderstood. The target speech stream can be embedded in

other speech streams to the point where the listener is unable

to segregate it from the others and cannot connect the ele-

ments of the target message that belong together. The listener

can be uncertain or confused about which talker to attend to

and thereby direct attention to the wrong source of speech.

There are other acoustic and perceptual factors that come

into play as well. The target speech, competing speech, and

other sounds usually reﬂect off of the various surfaces of the

sound ﬁeld, creating echoes that arrive at the ears delayed in

time and from various directions and that interact with the

direct sources of sound. Also, there is the normal, fundamen-

tal use of the sense of hearing to continually monitor the

auditory scene for important changes and rapidly evaluate

them when they occur, potentially interrupting and diverting

processing resources away from the target. Despite the

daunting complexity of this task, humans are normally quite

successful at selecting and comprehending the speech of one

talker among many talkers or other distracting and compet-

ing sources of sound. However, this complexity—comprised

of acoustic, perceptual, and cognitive factors—makes it ex-

tremely difﬁcult to completely describe the processes in-

volved and how they interact. Despite the fact that this ca-

pability has been studied extensively since Cherry 共1953兲

published his famous article describing the “cocktail party

problem” 共for recent reviews see Yost, 1997; Bronkhorst,

2000; and Ebata, 2003兲, a number of important questions

remain.

Among the questions about the cocktail party problem

for which we are lacking a satisfactory answer is that of the

importance of the ability to focus attention at a point along

the spatial dimension.1Clearly, attention must be focused on

the target source of speech if it is to be fully understood, but

there are many ways to segregate the target speech stream

from other sounds and the importance of the focus of atten-

tion along the spatial dimension, per se, is not well under-

stood. Scharf 共1998兲, for example, in his review of attention

in the auditory modality, notes that most of the evidence

regarding the role of spatial focus of attention does not indi-

cate large effects. Generally, cuing uncertain locations results

in relatively small improvements in response time and, in

some cases, in accuracy 共e.g., Spence and Driver, 1994;

Mondor and Zattore, 1995; Mondor et al., 1998; Sach et al.,

2000; Woods et al., 2001兲. However, very little of this work

has used speech as the stimulus. The question addressed in

the current study is: if the speech of the target talker is not

appreciably masked by competing sounds, and the sounds

and their sources are easily segregated into distinct auditory

objects, what is the beneﬁt of directing attention toward the

target source?

Determining the role of focused attention in the spatial

dimension is closely related to understanding how binaural

information is processed in the auditory system. There is an

extensive and compelling body of evidence in support of the

important role that binaural cues provide in hearing out a

target source among masking sources. However, binaural

a兲Portions of this work were presented at the 149th meeting of the Acoustical

of America in Vancouver, BC, Canada, May 2005.

b兲Electronic mail: gkidd@bu.edu

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

cues may be used in different ways at different stages of

processing in the auditory system to produce a selective lis-

tening advantage. The most extensively studied binaural cues

improve the effective target-to-masker ratio 共T/M兲of the in-

put from the auditory periphery to higher neural levels.

These include the “better-ear advantage” in which the spatial

separation of target and masker improves the acoustical T/M

in one ear relative to the case in which target and masker

emanate from the same location. Spatial separation of

sources also causes interaural differences which are pro-

cessed by neurons in the binaural portions of the ascending

auditory pathway to improve the effective T/M of the stimu-

lus 共cf. Durlach, 1972; Colburn, 1996; Colburn and Durlach,

1978兲. Binaural interaction is usually thought of as occurring

automatically 共i.e., not under voluntary control兲according to

the stimulus-driven properties of these neurons. The maxi-

mum advantage of spatial separation of a speech target and a

speech-shaped noise masker in a sound ﬁeld is about

8–10 dB 共larger advantages may be obtained for sources

very near the listener, e.g., Shinn-Cunningham et al., 2001兲

and is roughly equally attributable to contributions of the

better-ear advantage and binaural interaction 共Zurek, 1993;

see also Plomp, 1976; Bronkhorst and Plomp, 1988; Culling

et al., 2004兲.

When a speech target is masked by a noise, better-ear

listening and binaural interaction may almost completely ac-

count for the advantage afforded the listener by spatial sepa-

ration of sound sources. However, when the target is the

speech of one talker and the masker is the speech of another

talker共s兲, the problem is more complex and other factors

must be considered. First of all, perceptual segregation of a

human voice from a Gaussian noise is a trivial problem—

they differ in nearly every important way that might cause

them to be erroneously grouped together. Normally, a listener

has little difﬁculty distinguishing which object is noise and

which is speech and focusing attention on one or the other is

a simple matter. When the masker is another speech source,

however, the segregation task may be simple, but then again

it may not be, depending on how different the two talkers are

with respect to segregation cues such as fundamental fre-

quency, intonation patterns, envelope coherence across fre-

quency, timbre, etc. In such cases, segregating and directing

attention to the correct source may be difﬁcult indeed. Fur-

thermore, similar voices are easily confused and lead to er-

rors in speech recognition even for clearly segregated

sources particularly when the listener is uncertain about

which source is the target 共e.g., Brungart et al., 2001; Arbo-

gast et al., 2002兲. In selective listening tasks involving mul-

tiple talkers, it is often unclear whether the interference ob-

served in target speech recognition is a result of masking,

failure to segregate the target, confusion and misdirected at-

tention, or some combination of factors.

In attempting to determine the role played by selective

attention, manipulating the expectation of the observer is of-

ten key. Greenberg and Larkin 共1968兲demonstrated that lis-

teners exhibit a high degree of selectivity in the frequency

domain using the “probe-signal” method in which the signal

共target兲frequency had a much higher likelihood of occur-

rence than several surrounding probe frequencies. Although

both target and probe tones were equally detectable when

presented alone, detectability was higher in the mixed case

for the more likely target tone than for the less likely probe

tones, with performance 共as a function of frequency兲resem-

bling the attenuation characteristics of a bandpass ﬁlter.

Since the initial report by Greenberg and Larkin 共1968兲, the

technique has been used by many other investigators to dem-

onstrate attentional tuning in frequency 共e.g., MacMillan and

Schwartz, 1975; Scharf et al., 1987; Schlauch and Hafter,

1991; Green and McKeown, 2001兲; time 共Wright and Dai,

1994兲; spectral shape 共Hill et al., 1998兲; and modulation fre-

quency 共Wright and Dai, 1998兲.

Arbogast and Kidd 共2000兲found evidence for “tuning”

in spatial azimuth for both accuracy and response time mea-

sures in a probe-signal frequency pattern identiﬁcation task,

but the effects were relatively small and occurred when the

acoustic environment was very complex and uncertain. In

fact, most of the recent work on spatial attention has used

simple stimuli and tasks such as detection of the presence of

a tone in quiet or in noise and thus does not bear a close

correspondence to the complex multitalker problem posed

early on by Cherry 共1953兲.

Erickson et al. 共2004兲compared speech identiﬁcation

performance for conditions in which the location 共simulated

under headphones using head-related transfer functions,

HRTFs兲of a target talker was chosen at random from among

four possible locations to conditions in which the location of

the talker was held constant. They measured identiﬁcation

performance for a target talker in the presence of one to three

other talkers uttering similarly constructed sentences. For a

known target talker 共same voice across trials in a block of

trials兲, the performance advantage obtained by providing a

ﬁxed location was signiﬁcant when either two or three com-

peting talkers were present. The size of the advantage was

nearly 20 percentage points for the two-masker condition.

Recently, Brungart and Simpson 共2005兲extended these ﬁnd-

ings to conditions where target location changed probabilis-

tically across trials within a run. As the probability of a lo-

cation transition increased, speech identiﬁcation performance

decreased.

The results of the Erickson et al. 共2004兲and Brungart

and Simpson 共2005兲studies suggest that attending to a par-

ticular location along the spatial dimension, at least when the

listener knows the talker and/or has a priori knowledge

about the target sentence, can provide a signiﬁcant advantage

in recognizing the speech of a target talker in the presence of

competing talkers. This effect appears to be principally due

to directed attention rather than a “better ear” advantage or

binaural analysis. An important factor in the Erickson et al.

and Brungart and Simpson studies, as well as the Arbogast

and Kidd 共2000兲study mentioned earlier, was the presence

of a high degree of uncertainty. Perhaps the role of spatial

focus of attention is revealed more readily when the listening

task is very demanding and produces a heavy processing

load on the observer.

The present study is similar to that discussed above by

Erickson et al. 共2004兲but also has some important method-

ological differences. First, a condition was tested in which

there was no a priori knowledge provided to the listener

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3805

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

共within the context of the range of uncertainty in the experi-

ment兲about the target. In that condition, the callsign that

identiﬁes the target sentence was only provided after the

stimulus. This manipulation was intended to produce a very

high load on both the attention and the memory of the lis-

tener and is essentially a divided attention task. Second, un-

certainty about target location was varied probabilistically

over a range of values in order to produce a function relating

performance to degree of uncertainty.

II. METHODS

A. Listeners

The listeners were four normal-hearing college students

ranging in age from 19 to 22 years. Listeners were paid for

their participation.

B. Stimuli

The stimuli were sentences from the Coordinate Re-

sponse Measure 共CRM兲corpus 共Bolia et al., 2000兲. The four

male talkers were used. Sentences have the format: “Ready

关callsign兴go to 关color兴关number兴now.” For each talker, the

corpus contains sentences with all possible combinations of

eight callsigns, four colors, and eight numbers.

C. Procedures

The data were collected in a 12⫻13 ft soundﬁeld en-

closed by a single-walled IAC booth. The walls and ceiling

were perforated metal panels and the ﬂoor was carpeted. The

acoustic characteristics of this room are described in Kidd et

al. 共2005; room condition “BARE”兲. The stimuli were pre-

sented via three Acoustic Research 215PS loudspeakers lo-

cated 5 ft. from the listener and positioned at 0° and ±60°

where 0° is directly in front of the listener and +60° is to the

listener’s right. The height of the loudspeakers was approxi-

mately the same as the height of the listener’s head when

seated. These loudspeakers were calibrated and matched in

terms of overall level at the location of the listener’s head.

Each sentence was played through a separate channel of

Tucker-Davis Technologies hardware. Sentences were con-

verted at a rate of 40 kHz by a 16-bit, eight-channel D/A

converter 共DA8兲, low-pass ﬁltered at 20 kHz 共FT6兲, attenu-

ated 共PA4兲, and passed through power ampliﬁers 共Tascam兲

that were connected to the three loudspeakers.

On each trial, three sentences were presented simulta-

neously, one to each of the three loudspeakers. Each sentence

was played at 60 dB SPL. The three sentences were ran-

domly chosen on each trial with the requirement that the

talkers, callsigns, colors, and numbers of the three sentences

were all mutually exclusive. One sentence of the three was

randomly designated as the target sentence by providing the

listener with its callsign while the other two were considered

maskers. The listener’s task was to identify the color and

number from the target sentence in a 4⫻8-alternative

forced-choice procedure. A handheld keypad/LCD display

共Q-term兲was used to relay messages to the listener in the

booth and to register the listener’s responses. A warning on

the Q-term display preceded each trial. Data were collected

in blocks of 30 trials each. At the end of each block percent

correct feedback was provided for that block. Listeners par-

ticipated in the experiment in sessions of 1 1

2to 2 h each,

including several breaks. The listeners’ heads were not re-

strained, but they were instructed to face directly ahead 共0°

azimuth兲during stimulus presentation.

There were two main variables in the experiment. First,

the callsign indicating the target sentence could be provided

to the listener 共by visual display on the Q-term兲either a

minimum of 1 s before 共“callsign before”兲or immediately

after 共“callsign after”兲stimulus presentation. In both cases

the callsign display remained on the screen until after the

listener’s response was recorded and response feedback was

provided. Second, the a priori probabilities associated with

target occurrence at each location were varied. When one

loudspeaker was more likely to be the source of the target the

probabilities tested were 1, 0.8, and 0.6 and the probabilities

assigned to both of the other two loudspeakers were 0, 0.1,

and 0.2, respectively. Each callsign by probability condition

was tested separately for each of the three locations. There

was also a condition in which the target source was equally

likely among all three locations 共i.e., p=1

3兲that is referred to

as “random.” The probabilities assigned the three locations

were held constant across a block of 30 trials. The listener

was reminded of the probabilities associated with each loca-

tion at the start of every trial. The warning message that

preceded each stimulus presentation indicated the expected

percentage of trials for which the signal sentence would be

presented from each location. For example, “80-10-10” indi-

cated that the target sentence would be expected to be played

from −60° approximately 80% of the time and from the 0°

and +60° locations approximately 10% of the time each for

that block of trials. The sampling that determined target lo-

cation on any given trial was with replacement so that the

actual frequency of occurrence varied.

The combination of these variables yielded 24 condi-

tions 共2 callsigns⫻3 locations⫻4 probabilities兲. Data were

collected in two-block pairs of the same condition. After ev-

ery pair of blocks the callsign condition 共callsign before or

callsign after兲was changed with the initial callsign condition

chosen randomly for each listener. For every two blocks of

data for a given callsign before or callsign after condition,

the probability/location condition was chosen randomly

without replacement from among the 12 possible conditions

available 共3 locations⫻4 probabilities兲. A minimum of 16

blocks 共480 trials兲were collected for each callsign, location,

and probability condition. In the random probability condi-

tion, because there was no location subcondition, three times

as many blocks were collected for a minimum of 48 blocks

共1440 trials兲or the same number as in the other conditions

when summed across location. Listeners were minimally

trained in the task with a single block of 30 trials in which

the target sentence was played alone at 0° azimuth.

III. RESULTS

A. Accuracy

Performance was speciﬁed as proportion correct identi-

ﬁcation where a response was counted correct only if both

3806 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

the color and number of the target sentence were identiﬁed.

Chance performance was about 0.03 共1/32; 4 colors by 8

numbers兲. The four listeners were very similar in their per-

formance, therefore the results are displayed as group means

and standard deviations. Figure 1 shows proportion correct

identiﬁcation 共symbols兲as a function of the probability of

occurrence of the target at a speciﬁc location for the callsign

before 共circles兲and callsign after 共triangles兲conditions. For

both callsign before and callsign after, performance declined

as target location uncertainty increased. For the callsign after

condition, performance decreased from a proportion correct

of about 0.91 when the target location was certain 共i.e., p

=1兲to about 0.31 when the target location was randomly

chosen among the three locations. For callsign before, pro-

portion correct identiﬁcation performance was about the

same as for callsign after when the location was certain

共0.92兲and decreased to about 0.67 when the location of the

target was chosen at random. The dashed line at the bottom

indicates chance performance and the other two lines with no

symbols 共dotted and dash-dot兲will be discussed later.

In order to determine whether the trends apparent in Fig.

1 were statistically signiﬁcant, the data were transformed

into arcsine units and then submitted to a repeated-measures

ANOVA with three within-subjects factors: callsign, location

共−60°, 0°, +60°兲, and probability of occurrence at a given

location. All three main factors were signiﬁcant: callsign

关F共1,3兲= 59.3, p⬍0.01兴, location 关F共2,6兲= 75.5, p⬍0.001兴,

and probability 关F共3,9兲= 998.1, p⬍0.001兴. In addition, the

interaction of callsign and probability was signiﬁcant

关F共3,9兲= 95.5, p⬍0.001兴as were the interactions of location

and probability 关F共6,18兲=9.95 , p⬍0.001兴and callsign and

location 关F共2,6兲= 6.2, p⬍0.05兴. The three-way interaction

was not signiﬁcant 关p⬎0.05兴.

Knowing the callsign in advance potentially allowed the

listener to identify the target voice early in the stimulus and

either ﬁnd and focus on the location of the target talker or

simply follow the voice of the target talker until the test

items occurred, or both. However, it appears that knowing

the callsign beforehand without knowledge about location

was less useful as a cue than the converse. The proportion

correct identiﬁcation for the condition in which callsign was

certain 共callsign before兲paired with uncertain location 共p

=0.33, random兲was on average about 0.67 whereas the cer-

tain location condition 共p=1兲paired with uncertain callsign

共callsign after兲was about 0.91. When p=1 and the callsign

was given in advance, no additional improvement was noted

共0.92兲. The callsign-before random-location condition re-

sulted in roughly the same performance as the callsign-after

p=0.8 condition.

Not only are the main effects of callsign before versus

after and of probability of occurrence apparent in Fig. 1, the

signiﬁcant interaction between the two factors is also obvi-

ous. There was essentially no difference between callsign

after and callsign before at p= 1, but the difference between

the two increased systematically as uncertainty increased un-

til, in the random condition, the proportion correct in the

callsign before condition was about 0.36 higher than in the

callsign after condition. The main effect of location noted

above was signiﬁcant because, overall, presenting the target

from the location of +60° 共to the right兲resulted in a higher

proportion correct 共0.76兲than for the locations of −60°

共0.65兲or 0° 共0.66兲. However, as the signiﬁcant interaction

between location and probability of occurrence indicates, the

effect of location also depended on uncertainty and was

mainly due to the random condition. This effect may be seen

in Fig. 2, which displays proportion correct identiﬁcation as

a function of the target location uncertainty. The circles are

FIG. 1. Group mean proportion correct identiﬁcation scores, and standard

deviations of the means, plotted as a function of increasing uncertainty

about target location 共a priori probability of occurrence兲. The circles repre-

sent performance in the “callsign before” condition and triangles indicate

performance in the “callsign after” condition 共see text兲. The dashed line at

the bottom represents chance performance. The lines near the data points

indicate the performance predicted by a simple single-source listener strat-

egy for callsign before 共dash-dot兲and callsign after 共dot兲conditions.

FIG. 2. Group mean proportion correct identiﬁcation scores, and standard deviations of the means, subdivided according to the location from which the target

was presented.

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3807

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

for the callsign before condition and the triangles are for the

callsign after condition. Each panel is for a different location.

In the p= 1 case, no additional data sorting was required.

However, for the p⬍1 cases the data were sorted according

to actual location 共that is, the proportion correct is for all

trials in which the target was presented from that particular

location as opposed to proportion correct for all trials in the

nominal location condition兲. The error bars are ±1 standard

deviation of the mean. The effect of location increases with

increasing location uncertainty with the greatest difference in

performance, as noted above, for the two random conditions,

where proportion correct identiﬁcation was about 0.3 higher

for target at +60° than at −60°.

It is also of interest to determine how well listeners per-

formed when the target actually occurred at the expected

location versus when it occurred at an unexpected location.

This can be examined in the two intermediate probability

conditions. It might be expected, for example, that the listen-

ers always focused attention at the more likely location. In

that case, performance should equal that found in the p=1

condition 共i.e., about 0.92 correct兲on the trials when the

target was presented from that location. Figure 3 displays

proportion correct performance as a function of expected and

unexpected target locations. Expected location means that

the target sentence was played at the most likely location,

while unexpected location means that the target sentence was

played at one of the other two less likely locations. Circles

are for the callsign before condition and triangles are for the

callsign after condition. Unﬁlled symbols indicate target pre-

sentation at the expected location and ﬁlled symbols are for

target presentation at unexpected locations. Data are means

and standard deviations computed across listeners for all lo-

cations. The results from p= 1 and p= 0.33 are included here

for comparison 共same as in Fig. 1兲although there is no “un-

expected” case for p= 1 or “expected” case for p= 0.33. The

horizontal lines will be discussed later.

When the target was presented at the expected location,

proportion correct identiﬁcation performance was 0.8 or

greater in all conditions. In that case, the callsign condition

did not matter and the degree of uncertainty had a relatively

minor effect, decreasing from around 0.92 for p=1 to around

0.80 for p= 0.6. The decline in observed performance with

decreasing pundoubtedly reﬂects a cost associated with the

greater uncertainty about location and could indicate some

attempt by the listeners to increasingly divide or distribute

attention among locations. However, the small effect sug-

gests that this had only a minor inﬂuence on performance for

trials at the expected location.

In contrast to the results obtained when the target was

presented at the expected location, target presentation from

an unexpected location led to much poorer performance with

large differences observed between callsign before and call-

sign after conditions. For callsign before, performance actu-

ally improved as uncertainty increased from a proportion cor-

rect of about 0.43 for p= 0.8 to about 0.67 for p= 0.33. The

improvement in performance with increasing target-location

uncertainty is a reasonable outcome if one assumes that there

is a substantial penalty associated with attending to the

wrong location.

Perhaps the most striking result shown in Fig. 3 is that,

for the callsign after condition and p= 0.8 or 0.6, listeners

were almost never correct 共0.02 and 0.05, respectively兲when

the target occurred at an unexpected location. Even the

knowledge that the target would occur in the more likely

location only 60% of the time still did not improve perfor-

mance for the unexpected locations. Therefore, for callsign

after, the listeners appeared to focus attention almost entirely

at the expected location.

For the callsign before p= 0.8 and 0.6 conditions, listen-

ers were correct nearly half the time 共0.43 and 0.53, respec-

tively兲when the location was unexpected. This implies that

listeners used a combination of expected location and target

callsign to perform the task. They probably were not using

target callsign alone because performance for expected loca-

tions was signiﬁcantly better than for unexpected locations.

Obviously, they did not use location alone either because

they were correct for unexpected locations fairly often.

B. Predictions of a simple single-source listener

strategy

There are a number of strategies potentially available to

the listener in attempting to solve this task. It is not possible

from the data obtained in the current experiments, however,

to evaluate and decide among all of the alternative strategies

or observer models. On the other hand, it can be very useful

and informative to take a single, simple strategy and follow

through its assumptions and predictions. In this section, we

consider the predictions of one such strategy and make com-

parisons to the results described above.

There are three initial assumptions that deﬁne this lis-

tener strategy. First, it is assumed that the sources are per-

fectly segregated and that errors occur because attention is

directed to the incorrect source. It is already known, how-

ever, from the results discussed above, that performance was

not perfect even for the p= 1 case. Accordingly, the predic-

tions of listener performance that follow are scaled by a mul-

tiplier of 0.92 共the observed proportion correct in the certain

FIG. 3. Group mean proportion correct identiﬁcation scores 共ordinate兲, and

standard deviations of the means, subdivided according to whether the target

was presented at the expected 共more likely; open symbols兲or an unexpected

共less likely; ﬁlled symbols兲location. The abscissa is target location uncer-

tainty. The circles represent performance for the callsign before condition

while the triangles represent performance for the callsign after condition.

The horizontal lines not connecting data points are the predictions of the

single-source listener strategy discussed in the text.

3808 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

location case兲to reﬂect some 共unspeciﬁed兲limitations on the

ability to identify targets at attended locations. Second, it is

assumed that the listener attends to only one source at any

given moment in time. While this is a strong assumption that

excludes models based on divided attention, it also is

straightforward to evaluate and provides a means for deter-

mining whether interpretations based on divided attention are

necessary. The results described above for targets occurring

at unexpected locations for the callsign after conditions pro-

vide some degree of support for this assumption. Third, it is

assumed that the listener always attends to the more likely

location. In the case of callsign before presentation, this pro-

vides an opportunity for the listener to switch attention to

another location if it is determined that that location does not

contain the target. This would happen whenever a nontarget

callsign is presented from the location initially attended, i.e.,

the expected location. In those instances, it is assumed that

the listener randomly chooses to focus on one of the other

two sources because at the time switching occurs the other

callsigns would have already been presented and are as-

sumed not to be useful to the listener.

In determining the performance of this hypothetical lis-

tener, certain conditional probabilities may be deﬁned. First,

for callsign after, performance should simply be the probabil-

ity of occurrence scaled by the p= 1 value of 0.92. So, the

predicted performance for callsign after is

PCa=pPCmax,共1兲

where PCais the predicted proportion correct for the call-

sign after condition, pis the a priori probability of occur-

rence at one location, and PCmax is the highest proportion

correct possible for an attended location 共based on the p

=1 results兲. The predictions of listener performance from

this equation are shown as the dotted line in Fig. 1.

The predictions for the callsign before condition include

PCaas a term, but also include a term representing the in-

crease in performance expected by switching attention after

determining that the attended location is not the target loca-

tion. In that case

PCb=PC

a+ 0.5关共1−p兲PCmax兴,共2兲

where PCbis the predicted performance for the callsign

before condition. The predictions for performance based

on this equation are also shown in Fig. 1 as a dash-dotted

line. As a first approximation, this simple strategy ac-

counts for the group-mean accuracy results fairly well

共comparison of group mean data and lines in Fig. 1兲.

It is also possible to compare the predictions associated

with this strategy to the results shown in Fig. 3, where the

listener responses are computed according to target presen-

tation at expected and unexpected locations for the target

probabilities of 0.8 and 0.6. The predictions of this listener

strategy are straightforward. Based on the assumption that

the listener attends to the most likely location, performance

should equal PCmax whenever the target is actually presented

from that location for both the callsign before and callsign

after conditions. This prediction is shown on Fig. 3 as hori-

zontal lines 共dash-dotted and dotted for the two callsign con-

ditions兲both at a proportion correct value of 0.92. As noted

earlier, performance is slightly below the prediction, possibly

reﬂecting a cost associated with target location uncertainty.

When the target occurs at unexpected locations, the predic-

tions of this listener strategy differ markedly for callsign af-

ter and callsign before conditions. For callsign after, the only

information available to the listener is from the masker at the

expected location, so optimal performance would be to

choose a color/number combination from among the remain-

ing alternatives after eliminating the ones known to be inac-

curate. Thus, performance should be at around 0.05 propor-

tion correct 共3 colors ⫻7 numbers, with a small correction

for guessing on 1-PCmax proportion of the trials, shown by

the horizontal dotted line兲. In fact, the observed performance

is close to that prediction. In the random condition, we as-

sume the listener simply attends to one arbitrarily chosen

location and thus should obtain a proportion correct of p

times PCmax, or about 0.31 which, as noted above in Fig. 1, is

quite close to the data point.

For the callsign before conditions, once the masker call-

sign is heard from the expected location, the hypothetical

listener switches the focus of attention to one of the other

locations. The choice would be arbitrary because it is as-

sumed that only the callsign from the attended—and

incorrect—location was processed, so the listener would

have a 0.5 probability of selecting the correct source. Thus,

performance should be equal to PCmax times 0.5, or about

0.46 proportion correct 共shown as the dash-dotted line at that

value兲. Inspection of Fig. 3 suggests that, again, this listener

strategy predicts performance reasonably well with the ob-

tained performance for callsign before at unexpected loca-

tions near the predicted values.

Overall, this simple listener strategy was fairly success-

ful at predicting the proportion correct obtained from actual

listeners in most conditions. However, there were some ef-

fects that it cannot capture, such as the difference between

locations revealed in Fig. 2. Also, the results in Fig. 3 indi-

cate small but understandable differences between the 0.8

and 0.6 probability conditions that are not accounted for in

this simpliﬁed strategy. The results from the 0.6 condition

suggest a slightly greater tendency to distribute attention

with lower scores at the expected locations and higher scores

at unexpected locations.

C. Error analysis

It is also informative to examine the errors in identiﬁca-

tion made by listeners as uncertainty varied. This analysis is

of interest both for attempting to understand actual listener

performance and for evaluating the listener strategy consid-

ered above which makes strong predictions about the types

of errors that should occur. We should consider four broad

categories of errors that are possible. First, the listener could

confuse the target and a masker so that the color and number

that are reported correspond to those of one of the two

maskers. That type of error could be considered misdirection

of attention due to focusing on the wrong source. This would

form the great majority of errors predicted by the listener

strategy discussed above. A second type of error would be

due to random guessing from among the alternatives or per-

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3809

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

haps guessing after eliminating one color and one number as

hypothesized above for some callsign after conditions. Errors

from random guessing might occur if performance were lim-

ited by energetic masking, where the test words 共colors and

numbers兲were obscured and guessing was the only option

available. Third, the errors could take the form of mixing

colors and numbers from among the three sources, either one

word from the target and one from a masker or one each

from the two different maskers. This type of error might

occur due to a breakdown in stream segregation. That is, the

words are available but the listener is unable to properly

connect the callsign, color, and number. And, ﬁnally, the er-

rors could contain words not presented during the trial. This

type of error could result in a mixture of words including one

target or masker word and one word not presented on the

trial 共i.e., guessed from among the other alternatives兲. This

type of error is not as easily interpreted as the other three

types because there is a fairly high chance that it could occur

due to random guessing 共9 of the 32 alternatives are pre-

sented on each trial兲. The frequency of occurrence of each of

these error types can be determined by analyzing the errors

found in the experiment.

Figure 4 shows the proportions of incorrect responses

obtained in the experiment subdivided according to the three

main types of errors that were made. These data are plotted

as a function of location uncertainty, and, because the pat-

terns of errors were very similar across listeners, are shown

as group means. The different types of errors are indicated by

the black, gray, and white portions of the stacked bars. First,

the overall height of the bars represents the proportion of all

errors in which both of the responses 共color and number兲

matched any of the six key words spoken on a given trial.

For p=1, this accounted for a proportion of the errors equal

to 0.87 while for the other values of pthis accounted for

proportions between 0.93 and 0.97 regardless of callsign

condition. What this means is that there was very little guess-

ing that occurred in any of these conditions—the key words

were confused or mixed together but they appear to have

been available to the listener. By way of comparison, the

expected proportion of errors due to random guessing in

which both responses were from the words spoken on a

given trial is 0.26 共8 of the 9 possible combinations of words

presented on a trial are errors out of 31 total possible error

combinations兲. The obtained proportion of errors of this type

overall was 0.94 共same as average of the total heights of bars

in Fig. 4兲. These ﬁndings, along with the p=1 results, sug-

gest that very little energetic masking was present in any of

these conditions and support the conclusion that the second

type of error discussed above 共random guessing兲had a neg-

ligible effect on performance.

The different shadings indicate a ﬁner-grain analysis of

these errors. The black lower portions of the bars represent

the proportion of errors in which one of the words reported

was from the target sentence and the other word was from

one of the two masker sentences. This type of error was by

far the most common for p=1 共keeping in mind that there

were very few errors overall in this condition兲but was less

frequent for the other values of p. The occurrence of this

error would be consistent with a breakdown in the process of

speech stream segregation: the target and masker sentences

were not held separate, but were mixed. To the extent that

stream segregation involves perceptually connecting a se-

quence of sound elements that belong together, these “mix-

ing” errors reveal a failure of that process and dominated the

p=1 condition 共0.75 and 0.72 proportion of errors for call-

sign before and after, respectively兲. It also accounted for sub-

stantial proportions of errors 共ranging from about 0.18 to

0.38兲in the more uncertain location conditions. For these p

⬍1 conditions, the proportion of errors of this type were

somewhat higher for callsign before than callsign after.

The intermediate gray bars indicate the proportions of

errors in which both key words reported corresponded to the

key words from one of the two masker sentences. This was

the ﬁrst category of error discussed above. When the listener

was certain about where to listen 共p=1兲, confusions with

masker talkers rarely occurred 共0.10 to 0.14 proportion of

errors兲. For the conditions containing location uncertainty,

however, masker confusions formed the most common type

of error 共with overall proportions of errors ranging from 0.50

to 0.69, with higher proportions for callsign after than for

callsign before兲. This could occur if the spatial focus of at-

tention was directed to the wrong location and the listener

reported the color and number from that location. Finally, the

white bars at the top indicate the proportions of errors when

one word was from one masker and the other word was from

the other masker. That type of error was infrequent in all

conditions, did not differ with callsign knowledge, and was

least frequent for the p= 1 case.

The remaining errors not plotted in Fig. 4 are cases in

which at least one word reported did not correspond to any

of the words spoken 共one target word and one unspoken

word; one masker word and one unspoken word; both words

unspoken兲. The probability from random guessing for one

word spoken and one word unspoken is 0.58 共18 of 31兲,

whereas the obtained proportion of errors of that type was

0.055. The expected proportion of errors due to guessing

when both words reported were unspoken would be 0.16 共5

of 31兲, but the obtained proportion was only 0.005.

FIG. 4. This bar graph indicates the proportions of various types of errors in

the speech identiﬁcation task. The ordinate is the proportion of errors and

the abscissa is the degree of target location uncertainty. The different shad-

ings designate the types of errors 共see legend and text兲.

3810 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

As discussed above, the most common type of error for

the p⬍1 conditions was a masker confusion error. To what

extent does this error reﬂect focusing attention at the wrong

location? An analysis of errors that is relevant to this issue

involves determining the extent to which incorrect responses

in uncertain location conditions corresponded to the masker

sentence that was presented at the expected target location.

That is, the listener knew that the target was more likely at

one particular location and incorrectly reported the 共masker兲

color and number from that location. This analysis is only

possible for the 0.8 and 0.6 probability conditions. Figure 5

shows the proportion of incorrect responses that matched the

color and number from the masker sentence presented at the

expected location 共when the target was actually presented at

an unexpected location兲. The data are intersubject means and

standard deviations 共combined over the three locations兲. The

black bars are for callsign before and the white bars are for

the callsign after condition. In the callsign before condition,

the proportion of errors corresponding to the masker sen-

tence presented at the expected target location was between

0.4 and 0.5. For the callsign after condition, the correspond-

ing proportions of errors were much higher 共0.74 and 0.81兲.

These data, along with those presented in Fig. 4, are

inconsistent with the simple listener strategy described in

Sec. III B above. Although that strategy was generally suc-

cessful in accounting for the group-mean accuracy results in

the various conditions, the patterns of errors obtained were

not consistent with that strategy. The listener strategy exam-

ined above predicts that all of the errors 共except perhaps

those errors equal to 1-PCmax兲should be confusions with one

of the two maskers. That prediction was not supported by the

results. The proportions of errors for p⬍1 cases that were

not from one speciﬁc masker ranged from about 0.3–0.5 and

increased to almost 0.9 when p= 1. Thus, a substantial pro-

portion of errors occurred that could not be attributed to con-

fusing a masker with the target or simply reporting the sen-

tence from the wrong location. Furthermore, the callsign

before results shown in Fig. 5 are incompatible with the pre-

dictions of a switching strategy because nearly one-half of

the errors corresponded to reporting a masker presented at

the expected location. While these proportions are smaller

than those seen in the callsign after condition in which this

strategy is not possible, the switching strategy predicts that

none of the errors should correspond to that location. This

suggests a fairly strong tendency to report the content from

the expected location and not shift attention away from that

location during stimulus presentation. For callsign before,

that result is surprising because the listener should realize

that the expected location does not contain the target as soon

as the inaccurate callsign is heard. We do not have a good

explanation for this nonoptimal listener behavior and can

only speculate that the results reﬂect a strong bias in favor of

a priori information regarding location. The tendency to rely

on expected target location is even more strongly apparent

for the callsign after condition in which the proportions of

errors from the expected location were greater than 0.74. In

that case, if it were assumed that only the callsign, color, and

number from one location—the attended location—were re-

membered, then a more effective strategy would be simply to

guess from among the color and number alternatives after

excluding those from that masker. In that case, again, none of

the errors would correspond to the most likely target loca-

tion. The results in Fig. 5 reveal that listeners tended not to

adopt that strategy.

IV. DISCUSSION

The ﬁrst point to be made concerns the degree to which

the ﬁndings here answer the question posed in the Introduc-

tion regarding the role of spatial focus of attention in solving

the cocktail party problem. The results of the current study

support the conclusion that focusing attention toward a target

sound source in the presence of spatially distributed maskers

can provide a very signiﬁcant advantage in speech identiﬁ-

cation performance. However, this prominent role of spatial

focus of attention may depend on other factors such as the

complexity of the listening environment and the processing

load placed on the observer.

In order to conclude that the listening advantages found

in this study may be attributed to spatial focus of attention,

we must consider the possible role of other factors such as

masking and perceptual segregation of sounds. In contrast to

studies in which the spatial separation between sources was

varied in an attempt to reveal listening advantages due to

binaural interaction and better ear listening, the spatial loca-

tion of the sources in the present study did not vary across

conditions. Thus, the binaural cues of interaural time and

level differences allowed the listener to locate the sources

within the environment, but did not differ across test condi-

tions and whatever amount of masking was present did not

inﬂuence the main variables of interest. The only case where

acoustic differences may have played a factor is in the com-

parison of performance across ﬁxed locations. For example,

an acoustically “better ear” would occur when the target was

presented through either the left or right loudspeaker com-

pared to the center loudspeaker. As shown in Fig. 2, some

differences in performance were observed favoring the right

location in the random condition. However, this is not likely

to be an indication of binaural masking release because it

was not present consistently across conditions nor was it

FIG. 5. The proportions of incorrect responses that matched the color and

number of the masker sentence presented at the expected 共more likely兲

location when the target was presented at an unexpected 共less likely兲loca-

tion. The values plotted are group means and standard deviations for the

callsign before 共black bars兲and callsign after 共white bars兲conditions.

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3811

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

present for left location presentation, which would be ex-

pected to produce roughly the same binaural cues as the

symmetrically placed right location.

It is not clear from Fig. 2 whether the difference in per-

formance across loudspeaker locations reﬂects a genuine per-

ceptual effect or a bias on the part of the listeners to attend to

the location on the right.2There is a long history of study of

“right-ear advantages” in dichotic listening and, although the

effect has been attributed primarily to differences in the pro-

cessing of speech in the two hemispheres of the brain 共e.g.,

Kimura, 1967; Zatorre, 1989兲, there is also evidence that the

effect is labile and susceptible to attentional focus 共e.g., As-

bjornsen and Hugdahl, 1995; Asbjornsen and Bryden, 1996兲.

A further analysis of the data was undertaken to help under-

stand this effect in the current experiment. The results of this

analysis are shown in Fig. 6. This ﬁgure shows the propor-

tions of all words reported from the more likely location

regardless of whether the words were from targets 共lower

black portions of bars兲or from maskers 共upper white por-

tions of bars兲. Note that for the random condition, there was

no “more likely location” so the bars represent all responses

to words from each location.

Inspection of Fig. 6 indicates that a strong asymmetry

between right- and left-side locations is only clearly evident

in the random condition. In that condition, not only was there

a tendency to report target words more often when they were

presented from the right side, but there was also a tendency

to report masker words more often when presented from the

right side 共primarily for the callsign after condition but in

both callsign conditions both target and masker words were

reported less often when located to the left兲. Thus, the words

presented from the right loudspeaker were chosen more often

than those from 共in particular兲the left loudspeaker regardless

of whether they were correct or incorrect. This response pat-

tern clearly reﬂects bias on the part of the listeners because

the expected probability of occurrence from the three loca-

tions in the random condition was equal. This bias largely

disappears when uncertainty about target location is de-

creased. Whether there is also a genuine processing differ-

ence, and whether that is in some way related to the observed

bias, cannot be ascertained from these data. Another relevant

point is that, because the stimuli were presented via loud-

speakers, the speech from all three talkers was present in

both ears on every trial. Thus, the differences in performance

must be related to the differences in acoustics as target posi-

tion varied. The acoustic differences between target and

masker talkers, though, are much less than in dichotic listen-

ing tasks where the stimuli are presented separately to the

two ears. It is not clear how the effect found here is related to

these acoustic differences. It should be noted that both Bolia

et al. 共2001兲and Brungart and Simpson 共2005兲have reported

right hemiﬁeld identiﬁcation performance advantages using

the CRM test for earphone-based stimuli processed by

HRTFs.

The analysis presented in Fig. 6 also provides insight

into another issue. There is an extensive literature concerning

the tendency of subjects to adjust their response strategy to

match the a priori probabilities of different response alterna-

tives being correct, or having different payoffs, even if that

strategy is nonoptimal 共called “probability matching;” e.g.,

Shanks et al., 2002; West and Stanovich, 2003; also, review

by Vulkan, 2000兲. In the current experiments, evidence for

probability matching might be found by comparing the pro-

portions of responses to stimuli from expected locations to

the a priori probabilities of target presentation from those

locations. For callsign before, there is indeed a reasonably

close correspondence between overall response rates and the

assigned probabilities 共Fig. 6兲. However, because the callsign

was provided in addition to location probabilities, it cannot

be determined if this correspondence truly reﬂects matching

responses to probabilities of occurrence at the different loca-

tions or is a by-product of a combined callsign-location

weighting strategy. Furthermore, the opportunity for switch-

ing the attended location based on callsign also could pro-

duce the proportions of responses shown in Fig. 6. For call-

sign after for p= 0.8 and, especially, p=0.6, little support for

a probability matching interpretation is apparent. The ob-

tained proportions of responses from the expected locations

for those two conditions are greater than 0.9 and 0.8, respec-

tively, which are substantially above the corresponding prob-

abilities of occurrence.

The ﬁnding that identiﬁcation performance was so accu-

rate when p= 1 provides support for the conclusion that, gen-

erally, masking was minimal and the sound sources could be

FIG. 6. The proportion of all words 共single words or pairs兲reported from the more likely location. This includes both words from the target 共correct

responses—lower black portion of bars兲and from a masker 共incorrect responses—upper white portion of bars兲. In each panel both callsign before and after

responses are shown at each loudspeaker location. In the random condition 共right-most panel兲these are just proportions of responses by location because there

was no “more likely” location.

3812 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

successfully segregated. This conclusion is based on the as-

sumption that both factors are necessary for successful iden-

tiﬁcation to occur. However, errors attributed to a breakdown

in the perceptual segregation of the speech streams were

present in all conditions. Although these “mixing errors”

were the most common type of error for the certain-location

conditions, the errors were still very infrequent with propor-

tion correct scores over 0.9 found for both callsign condi-

tions. The success of the listeners in those conditions indi-

cates that the speech streams can be segregated, but the

maintenance of separate streams is vulnerable to increasing

uncertainty about target location. Here, when attention is fo-

cused at the wrong location, performance suffers with in-

creased uncertainty and errors consistent both with misdi-

rected attention and loss of stream segregation were found.

Furthermore, because identiﬁcation performance was not

perfect even when the most a priori information was pro-

vided to the listener 共about 0.92 proportion correct for the

callsign before, p= 1 condition兲, some interference caused by

nontarget sound sources clearly was present 共proportion cor-

rect identiﬁcation performance for a single-talker baseline

condition was nearly perfect兲. Thus, the two more common

types of errors were confusions between target and masker,

which we believe indicate misdirection of attention, and mix-

ing target and masker words, which probably indicates a

breakdown in maintaining speech stream segregation.

The present results also imply that surprisingly little

shifting or redirecting of attention occurred during trials.

When the callsign was provided before trials, but the target

occurred at an unexpected location 共p=0.6 and 0.8兲, nearly

one-half of the errors corresponded to the masker at the tar-

get’s most likely location. These conditions indicate a rivalry

of cues: target talker versus location. In this case, directing

attention to the wrong location clearly diminished the ability

of the listener to follow the speech of the target talker from

the callsign to the test words. These scores are worse, in fact,

than in the random case where proportion correct perfor-

mance was about 0.7. Thus, when the target talker was indi-

cated beforehand there was a penalty of directing attention to

the wrong location.

Comparison of performance in the callsign before and

callsign after conditions overall indicates the interaction be-

tween the processing load imposed by the task and the im-

portance of spatial focus of attention. Ideally, if an observer

were capable of remembering all three simultaneous sen-

tences, speciﬁcally, associating and remembering the three

key words from each source—the callsign, color, and num-

ber, then spatial information would be unnecessary and per-

formance in both callsign before and callsign after conditions

would be equivalent. Clearly, this was not possible when

there were three simultaneous talkers. According to the “load

theory of attention” proposed by Lavie et al. 共2004兲, when

observers are faced with a very demanding perceptual task—

segregating one element of an array of similar elements, for

example—the nontarget elements produce little interference

in the selection and processing of the target. This is because

of the assumption that all available perceptual resources are

occupied at any given point in time. If the resources required

to process the target occupy the entire pool of resources

available, none are left to allocate to distracting sources, so

little or no interference occurs. However, if the perceptual

task is not demanding but the subsequent cognitive control

load is, as would be the case if a high load were placed on

working memory, much greater interference from distracters

is predicted. In the current conditions, despite the fact that

the three sources are male talkers, the segregation task is

relatively easy. This conclusion is based on the high identi-

ﬁcation scores in the certain-location conditions. However,

particularly in the callsign after case, the cognitive load—the

demands on working memory that would be sufﬁcient to

solve the task—are very high. Furthermore, when uncer-

tainty about location increased, errors in perceptual

segregation—mixing errors—increased as well 共note that

here we are referring to the combination of number of errors

as in Fig. 1 and the error type as in Fig. 4兲. If we assume that

increasing uncertainty mainly affects cognitive load, then in-

creasing errors from distracters—either misdirected attention

or loss of stream segregation—would be expected. Thus, al-

though the evidence in support of the load theory usually

takes the form of response times 共as do most data concerned

with auditory attention, with some exceptions, cf. Sach et al.,

2000; Erickson et al., 2004兲, our results appear to be quali-

tatively consistent with that theory.

We do not have data for other numbers of talkers 共except

for the control case of one talker兲but speculate that equally

strong spatial effects could be observed for four or more

talkers. This is because the three-talker condition has already

degraded performance to near the limiting case imposed by

the number of potential target sources. Assuming that each

source may be segregated from the others and that only one

message can be remembered, then performance in the call-

sign after-random condition should at best be equal to the

reciprocal of the number of sources, if each is equally likely

to be the target. That is, the listener chooses one to attend to

and does so perfectly and completely to the exclusion of the

others. For callsign after, performance in the random condi-

tion was near that which would be expected based on this

strategy. An interesting case, then, is for two simultaneous

sources. At present, we have some indications that perfor-

mance in the two-talker callsign after condition may exceed

the reciprocal of the number-of-sources limit 共Gallun et al.,

2005兲.

Recently, the two- vs three-source comparison for mul-

titalker speech identiﬁcation has been considered by Brun-

gart and Simpson 共2002兲. In their study, the task was the

same as here—reporting the color and number for a desig-

nated callsign in the CRM task—and there were either one or

two other talkers also uttering sentences from the CRM task.

However, the stimuli were presented via earphones. When

the target, which was presented to the “ipsilateral” ear, was

mixed with one masker in the same ear, performance varied

predictably according to T/M. When one masker was pre-

sented to the contralateral ear, no interference in target iden-

tiﬁcation was observed; i.e., identiﬁcation performance ap-

proached 100% correct. When both maskers—one ipsilateral

and one contralateral—were presented, target identiﬁcation

performance was as much as 40 percentage points poorer

than the single 共ipsilateral兲masker condition. Brungart and

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3813

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

Simpson 共2002兲attributed this large decline in performance

to the much greater difﬁculty in ignoring two sources, rather

than one source, when there was a difﬁcult segregation task

to perform in the target ear. Kidd et al. 共2003兲found a par-

allel effect using complex nonspeech stimuli. The greater

difﬁculty in three-talker conditions than in two-talker condi-

tions has been noted by Yost et al. 共1996兲and Hawley et al.

共2004兲. In the current study, we speculate that the high pro-

cessing load caused by three simultaneous talkers contrib-

uted to the large beneﬁcial effect of a priori information,

especially that of the most likely target location.

V. SUMMARY AND CONCLUSIONS

A priori knowledge about where to direct attention pro-

vided signiﬁcant advantages in speech identiﬁcation in

highly uncertain multitalker listening conditions. This result

was found for both cued and uncued target sentence presen-

tation. The differences in performance found across the vari-

ous conditions could not be attributed to binaural analysis,

masking, or perceptual segregation, but rather appear to be

uniquely related to focus of attention. The pattern of errors

suggested a high degree of spatial selectivity with very little

processing of speech originating from unattended locations.

A simple single-source listener strategy was found to predict

accuracy fairly well in most conditions but failed to account

for the observed patterns of errors. The current results sup-

port the view that spatial focus of attention can be a very

important factor in complex and uncertain multisource listen-

ing environments and may play a crucial role in solving the

“cocktail party problem.”

ACKNOWLEDGMENTS

The authors are grateful to Kelly Egan for her assis-

tance. They would also like to thank Nathaniel I. Durlach

and Barbara Shinn-Cunningham for comments on an earlier

version of the manuscript. This work was supported by

Grants Nos. DC00100, DC04545, and DC04663 from NIH/

NIDCD and by AFOSR Award No. FA9550-05-1-2005. Fre-

derick Gallun was supported by F32 DC006526 from

NIDCD.

1The use of the term “spatial dimension” in this article is conﬁned to varia-

tions in sound source azimuth. A complete description of sound source

location includes distance from the listener and elevation, which are not

considered here.

2Preferential selection of objects or events on the right side has been ob-

served in many tasks other than speech recognition 共e.g., Nisbett and Wil-

son, 1977兲.

Arbogast, T. L., and Kidd, G., Jr. 共2000兲. “Evidence for spatial tuning in

informational masking using the probe-signal method,” J. Acoust. Soc.

Am. 108, 1803–1810.

Arbogast, T. L., Mason, C. R., and Kidd, G., Jr. 共2002兲. “The effect of

spatial separation on informational and energetic masking of speech,” J.

Acoust. Soc. Am. 112 , 2086–2098.

Asbjornsen, A. E., and Bryden, M. P. 共1996兲. “Biased attention and the

fused dichotic words test,” Neuropsychologia 34,407–411.

Asbjornsen, A., and Hugdahl, K. 共1995兲. “Attentional effects in dichotic

listening,” Brain Lang 49, 189–201.

Bolia, R. S., Nelson, T. W., and Morley, R. M. 共2001兲. “Asymmetric per-

formance in the cocktail party effect: Implications for the design of spatial

audio displays,” Hum. Factors 43, 208–216.

Bolia, R. S., Nelson, W. T., Ericson, M. A., and Simpson, B. D. 共2000兲.“A

speech corpus for multitalker communications research,” J. Acoust. Soc.

Am. 107, 1065–1066.

Bronkhorst, A. W. 共2000兲. “The cocktail party phenomenon: A review of

research on speech intelligibility in multiple-talker conditions,” Acust.

Acta Acust. 86, 117–128.

Bronkhorst, A. W., and Plomp, R. 共1988兲. “The effect of head-induced in-

teraural time and level differences on speech intelligibility in noise,” J.

Acoust. Soc. Am. 83, 1508–1516.

Brungart, D. S., and Simpson, B. D. 共2002兲. “Within-ear and across-ear

interference in a cocktail-party listening task,” J. Acoust. Soc. Am. 112,

2985–2995.

Brungart, D. S., and Simpson, B. D. 共2005兲. “Cocktail party listening in a

dynamic multitalker environment,” 共unpublished兲.

Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. 共2001兲.

“Informational and energetic masking effects in the perception of multiple

simultaneous talkers,” J. Acoust. Soc. Am. 11 0, 2527–2538.

Cherry, E. C. 共1953兲. “Some experiments on the recognition of speech, with

one and two ears,” J. Acoust. Soc. Am. 25, 975–979.

Colburn, H. S. 共1996兲. “Computational models of binaural processing,” in

Auditory Computation, edited by H. Hawkins, T. McMullin, A. N. Popper,

and R. R. Fay 共Springer-Verlag, New York兲, pp. 332–400.

Colburn, H. S., and Durlach, N. I. 共1978兲. “Models of binaural interaction,”

in Handbook of Perception, Vol. IV, Hearing, edited by E. C. Carterette

and M. P. Friedman 共Academic, New York兲.

Culling, J. F., Hawley, M. L., and Litovsky, R. Y. 共2004兲. “The role of

head-induced interaural time and level differences in the speech reception

threshold for multiple interfering sound sources,” J. Acoust. Soc. Am.

116 , 1057.

Durlach, N. I. 共1972兲. “Binaural signal detection: Equalization and cancel-

lation theory,” in Foundations of Modern Auditory Theory, Vol. II, edited

by J. V. Tobias 共Academic, New York兲.

Ebata, M. 共2003兲. “Spatial unmasking and attention related to the cocktail

party problem,” Acoust. Sci. Tech. 24, 208–219.

Erickson, M. A., Brungart, D. S., and Simpson, B. D. 共2004兲. “Factors that

inﬂuence intelligibility in multitalker speech displays,” J. Aviation Psych.

14, 311–332.

Gallun, F. J., Mason, C. R., and Kidd, G., Jr. 共2005兲. “Task-dependent costs

in processing two simultaneous auditory stimuli,” 共unpublished兲.

Green, T. J., and McKeown, J. D. 共2001兲. “Capture of attention in selective

frequency listening,” J. Exp. Psychol. Hum. Percept. Perform. 27, 1197–

1210.

Greenberg, G. S., and Larkin, W. D. 共1968兲. “Frequency-response charac-

teristic of auditory observers detecting signals of a single frequency in

noise: The probe-signal method,” J. Acoust. Soc. Am. 44, 1513–1523.

Hawley, M. L., Litovsky, R. Y., and Culling, J. F. 共2004兲. “The beneﬁt of

binaural hearing in a cocktail party: Effect of location and type of inter-

ferer,” J. Acoust. Soc. Am. 115, 833– 843.

Hill, N. I., Bailey, P. J., and Hodgson, P. 共1998兲. “A probe-signal study of

auditory discrimination of complex tones,” J. Acoust. Soc. Am. 102,

2291–2296.

Kidd, G., Jr., Mason, C. R., Brughera, A., and Hartmann, W. M. 共2005兲.

“The role of reverberation in release from masking due to spatial separa-

tion of sources for speech identiﬁcation,” Acust. Acta Acust. 91, 526 –536.

Kidd, G., Jr., Mason, C. R., Arbogast, T. L., Brungart, D., and Simpson, B.

共2003兲. “Informational masking caused by contralateral stimulation,” J.

Acoust. Soc. Am. 113 , 1594–1603.

Kimura, D. 共1967兲. “Functional asymmetry of the brain in dichotic listen-

ing,” Cortex 22, 163–178.

Lavie, N., Hirst, A., de Fockert, J. W., and Viding, E. 共2004兲. “Load theory

of selective attention and cognitive control,” J. Exp. Psychol. Gen. 133,

339–354.

Macmillan, N. A., and Schwartz, M. 共1975兲. “A probe-signal investigation

of uncertain-frequency detection,” J. Acoust. Soc. Am. 58, 1051–1058.

Mondor, T. A., and Zatorre, R. J. 共1995兲. “Shifting and focusing auditory

spatial attention,” J. Exp. Psychol. Hum. Percept. Perform. 21, 397–409.

Mondor, T. A., Zattore, R. J., and Terrio, N. A. 共1998兲. “Constraints on the

selection of auditory information,” J. Exp. Psychol. Hum. Percept. Per-

form. 24, 66–79.

Nisbett, R. E., and Wilson, T. C. 共1977兲. “Telling more than we know:

Verbal reports on mental processes,” Psychol. Rev. 84, 231–259.

Plomp, R. 共1976兲. “Binaural and monaural speech intelligibility of con-

nected discourse in reverberation as a function of azimuth of a single

3814 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

competing sound source 共speech or noise兲,” Acustica 34, 200–211.

Sach, A. J., Hill, N. I., and Bailey, P. J. 共2000兲. “Auditory spatial attention

using interaural time differences,” J. Exp. Psychol. Hum. Percept. Per-

form. 26, 717–729.

Scharf, B. 共1998兲. “Auditory attention: The psychoacoustical approach,” in

Attention, edited by H. Pashler 共Psychology Press, East Sussex, UK兲, pp.

75–117.

Scharf, B., Quigley, S., Aoki, C., Peachey, N., and Reeves, A. 共1987兲.“Fo-

cused auditory attention and frequency selectivity,” Percept. Psychophys.

42, 215–223.

Schlauch, R. S., and Hafter, E. R. 共1991兲. “Listening bandwidths and fre-

quency uncertainty in pure-tone signal detection,” J. Acoust. Soc. Am. 90,

1332–1339.

Shanks, D. R., Tunney, R. J., and McCarthy, J. D. 共2002兲. “A re-

examination of probability matching and rational choice,” J. Behav. Dec.

Making 15, 233–250.

Shinn-Cunningham, B. G., Schickler, J., Kopco, N., and Litovsky, R. Y.

共2001兲. “Spatial unmasking of nearby speech sources in a simulated

anechoic environment,” J. Acoust. Soc. Am. 11 0, 1118–1129.

Spence, C. J., and Driver, J. 共1994兲. “Covert spatial orienting in audition:

Exogenous and endogenous mechanisms,” J. Exp. Psychol. Hum. Percept.

Perform. 20, 555–574.

Vulkan, N. 共2000兲. “An economist’s perspective on probability matching,” J.

Econ. Surveys 14, 101–118.

West, R. F., and Stanovich, K. E. 共2003兲. “Is probability matching smart?

Associations between probabilistic choices and cognitive ability,” Mem.

Cognit. 31, 243–251.

Woods, D. L., Alain, C., Diaz, R., Rhodes, D., and Ogawa, K. H. 共2001兲.

“Location and frequency cues in auditory selective attention,” J. Exp.

Psychol. Hum. Percept. Perform. 27, 65–74.

Wright, B. A., and Dai, H. 共1994兲. “Detection of unexpected tones with

short and long durations,” J. Acoust. Soc. Am. 95, 931–938.

Wright, B. A., and Dai, H. 共1998兲. “Detection of sinusoidal amplitude

modulation at unexpected rates,” J. Acoust. Soc. Am. 104, 2991–2997.

Yost, W. A. 共1997兲. “The cocktail party problem: Forty years later,” in

Binaural and Spatial Hearing in Real and Virtual Environments, edited by

R. A. Gilkey and T. R. Anderson 共Erlbaum, Hillsdale, NJ兲, pp. 329–348.

Yost, W. A., Dye, R. H., and Sheft, S. 共1996兲. “A simulated ‘cocktail party’

with up to three sound sources,” Percept. Psychophys. 58, 1026–1036.

Zattore, R. J. 共1989兲. “Perceptual asymmetry on the dichotic fused words

test and cerebral speech lateralization determined by the carotoid sodium

amytal test,” Neuropsychologia 27, 1207–1219.

Zurek, P. M. 共1993兲. “Binaural advantages and directional effects in speech

intelligibility,” in Acoustical Factors Affecting Hearing Aid Performance,

edited by G. A. Studebaker and I. Hochberg 共Allyn and Bacon, Boston兲,

pp. 255–276.

J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3815

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

Content uploaded by Frederick Gallun

Author content

1.21091

87.pdf

PDF

206.44 KB

Download file

ResearchGate has not been able to resolve any citations for this publication.

The Role of Reverberation in Release from Masking Due to Spatial Separation of Sources for Speech Identification

Article

Full-text available

May 2005
ACTA ACUST UNITED AC

Arbogast et al. [1] found a large release from masking obtained by the spatial separation of a target talker and a competing speech masker. Both signal and masker were sentences from the Coordinate Response Measure corpus processed by extracting the envelopes of 15 narrow frequency bands and using the envelopes to modulate carrier tones at the center of each band. By playing nonoverlapping subsets (6–8) of bands from the signal and masker, the energetic component of masking was minimized while the informational component was maximized. The current study extended that work to determine the interaction between reverberation, masker type and spatial release from masking. The stimuli were processed in the same way and were presented in the same spatial configuration as the earlier study. The target sentence was presented at 0-deg azimuth while the masker sentence was played at either 0- or 90-deg azimuth. Noise-masker controls, comprised of overlapping or nonoverlapping frequency bands, were also tested. The listening environment was an IAC booth having interior dimensions of 12'4" × 13' × 7'6". Acoustic extremes were achieved by covering all surfaces with materials that were either highly reflective (Plexiglas® panels) or highly absorbent of sound (Silent Source® foam wedges). The results indicated that the amount of masking and the spatial release from masking depended both on the characteristics of the room and the masker type. When the masker was primarily energetic, spatial release from masking decreased from a maximum of about 8 dB in the least reverberant room to about 2 dB in the most reverberant room. For the informational masker, a larger advantage of 15–17 dB was found that was not affected by reverberation. Our interpretation of these findings was that spatial separation of sources could improve speech identification through acoustic filtering by the head, binaural interaction, and the strengthening of perceptual segregation of sound images. However, only the latter effect appears to be relatively insensitive to reverberation.

Telling more than we can know: Verbal reports on mental processes

Article

Jan 1977

Evidence is reviewed which suggests that there may be little or no direct introspective access to higher order cognitive processes. Subjects are sometimes (a) unaware of the existence of a stimulus that importantly influenced a response, (b) unaware of the existence of the response, and (c) unaware that the stimulus has affected the response. It is proposed that when people attempt to report on their cognitive processes, that is, on the processes mediating the effects of a stimulus on a response, they do not do so on the basis of any true introspection. Instead, their reports are based on a priori, implicit causal theories, or judgments about the extent to which a particular stimulus is a plausible cause of a given response. This suggests that though people may not be able to observe directly their cognitive processes, they will sometimes be able to report accurately about them. Accurate reports will occur when influential stimuli are salient and are plausible causes of the responses they produce, and will not occur when stimuli are not salient or are not plausible causes.

Computational Models of Binaural Processing

Chapter

Jan 1996

H. Steven Colburn

Computational models in this chapter are defined to include models that lead to explicit, quantitative predictions for the phenomena that are being modeled. They may be posed purely in terms of the information that is available for the task, in which case the computed predictions are evaluated using information-theoretical or other statistical communication theory techniques, or they may be posed in terms of mechanisms or algorithms. Both types of computational models are included in this chapter. We do not include models that have been suggested but not evaluated or models which are not sufficiently explicit to allow precise predictions.

The role of head-induced interaural time and level differences in the speech reception threshold for multiple interfering sound sources

Article

Aug 2004

Three experiments investigated the roles of interaural time differences (ITDs) and level differences (ILDs) in. spatial unmasking in multi-source environments. In experiment 1, speech reception thresholds (SRTs) were measured in virtual-acoustic simulations of an anechoic environment with three interfering sound sources of either speech or noise. The target source lay directly ahead, while three interfering sources were (1) all at the target's location (0degrees,0degrees,0'), (2) at locations distributed across both hemifields (-30degrees,60degrees,90degrees); (3) at locations in the same hemifield (30degrees,60degrees,90degrees), or (4) co-located in one hernifield (90degrees,90degrees,90degrees). Sounds were convolved with head-related impulse responses (HRIRs) that were manipulated to remove individual binaural cues. Three conditions used HRIRs with (1) both ILDs and ITDs, (2) only ILDs, and (3) only ITDs. The ITD-only condition produced the same pattern of results across spatial configurations as the combined cues, but with, smaller differences between spatial configurations. The ILD-only condition yielded similar SRTs for the (-30degrees,60degrees,90degrees) and (0degrees,0degrees,0degrees) configurations, as expected for best-ear listening. In experiment 2, pure-tone BMLDs were measured at third-octave frequencies against the ITD-only, speech-shaped noise interferers of experiment 1. These BMLDs were 4-8 dB at low frequencies for all spatial configurations. In experiment 3, SRTs were measured for speech in diotic, speech-shaped noise. Noises were filtered to reduce the spectrum level at each frequency according to the BMLDs measured in experiment 2. SRTs were as low or lower than those of the corresponding ITD-only conditions from experiment 1. Thus, an explanation of speech understanding in complex listening environments based on the combination of best-ear listening and binaural unmasking (without involving sound-localization) cannot be excluded. (C) 2004 Acoustical Society of America.

Binaural and Monaural Speech Intelligibility of Connected Discourse in Reverberation as a Function of Azimuth of a Single Competing Sound Source (Speech or Noise)

Article

Feb 1976
ACTA ACUST UNITED AC

R. Plomp

By means of the Békésy up-and-down tracking technique the masked threshold of intelligibility of connected discourse with 0° azimuth was determined as a function of azimuth of a single competing sound source, with reverberation time (0 ≦ T ≦ 2.3 s) as a parameter. In half of the experiments the competing sound, too, was connected discourse, in the other half it was noise with the long-term average spectrum of speech. The masked threshold of intelligibility was measured binaurally and monaurally (one ear occluded) for ten listeners. Main results are: (1) the masked threshold of intelligibility is about 3 dB lower for speech than for noise as the competing sound; (2) in binaural hearing with T = 0, the gain of spatial divergence of the two sound sources is maximally about 5 dB; (3) the masked threshold in binaural hearing is about 2.5 dB lower than in monaural hearing with the competing sound source on the side of the occluded ear; (4) reverberation has a large effect on the masked threshold of intelligibility; (5) for normal-hearing subjects, the margin for discriminating speech is so small that, even without reverberation, subjects with unilateral deafness or sensorineural hearing impairment are seriously impeded by a single competing talker.

Functional Asymmetry of the Brain in Dichotic Listening

Article

Jun 1967
CORTEX

Doreen Kimura

This paper reviews the evidence relating lateral asymmetry in auditory perception to the asymmetrical functioning of the two hemispheres of the brain. Because each ear has greater neural representation in the opposite cerebral hemisphere, the predominance of the left hemispere for speech is reflected in superior recognition for words arriving at the right ear, while the predominance of the right hemisphere in melodic-pattern perception is reflected in superior identification of melodies arriving at the left ear. Some applications of the dichotic listening technique to questions concerned with the development of cerebral dominance, and with the further specification of function of the left and right hemispheres, are also described.

Binaural Advantages and Directional Effects in Speech Intelligibility

Article

Jan 1993

Patrick Zurek

Spatial unmasking and attention related to the cocktail party problem

Article

Sep 2003
Acoust Sci Tech

Masanao Ebata

Although there are several factors causing "cocktail party effect" after more than half a century of research, the major one is considered to be the spatial separation of the target signal and the interferer. This paper will overview developments of the improvement of performance resulting from the directional separation of the target signal from interferers when listening in a field or through headphones. The basic assumption concerning the cocktail party effect is that there are one or more interfering sound sources in addition to the target signal source. In this situation it is important to remember the selective attention effect, which attenuates the interfering sound by concentrating the attention on a specific signal. Pitch of sound is the simplest cue for selective attention; however, spatial information can also be one. The latter half of this review discusses the effect of spatial filtering and an attention filter on the frequency domain.

Models of binaural interaction

Article

Dec 1978

Some Experiments on the Recognition of Speech With One and With Two Ears

Article

Sep 1953

E. Colin Cherry

This paper describes a number of objective experiments on recognition, concerning particularly the relation between the messages received by the two ears. Rather than use steady tones or clicks (frequency or time‐point signals) continuous speech is used, and the results interpreted in the main statistically. Two types of test are reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.

The advantage of knowing where to listen

Article

January 2006

Gerald Kidd · Tanya Arbogast · Christine R Mason · Frederick Gallun

Download

1.2109187

File (1)

Linked Research

Recommended publications

The advantage of knowing where to listen

The advantage of knowing where to listen

Combining energetic and informational masking for speech identification

The influence of spatial separation on divided listening