Content uploaded by Frederick Gallun
Author content
All content in this area was uploaded by Frederick Gallun on Jan 05, 2016
Content may be subject to copyright.
The advantage of knowing where to listena)
Gerald Kidd, Jr.,b兲Tanya L. Arbogast, Christine R. Mason, and Frederick J. Gallun
Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University,
635 Commonwealth Avenue, Boston, Massachusetts 02215
共Received 31 May 2005; revised 9 September 2005; accepted 12 September 2005兲
This study examined the role of focused attention along the spatial 共azimuthal兲dimension in a
highly uncertain multitalker listening situation. The task of the listener was to identify key words
from a target talker in the presence of two other talkers simultaneously uttering similar sentences.
When the listener had no a priori knowledge about target location, or which of the three sentences
was the target sentence, performance was relatively poor—near the value expected simply from
choosing to focus attention on only one of the three locations. When the target sentence was cued
before the trial, but location was uncertain, performance improved significantly relative to the
uncued case. When spatial location information was provided before the trial, performance
improved significantly for both cued and uncued conditions. If the location of the target was certain,
proportion correct identification performance was higher than 0.9 independent of whether the target
was cued beforehand. In contrast to studies in which known versus unknown spatial locations were
compared for relatively simple stimuli and tasks, the results of the current experiments suggest that
the focus of attention along the spatial dimension can play a very significant role in solving the
“cocktail party” problem. © 2005 Acoustical Society of America. 关DOI: 10.1121/1.2109187兴
PACS number共s兲: 43.66.Lj, 43.66.Dc, 43.66.Pn 关AJO兴Pages: 3804–3815
I. INTRODUCTION
There are many factors that can interfere with a listener
attempting to comprehend the speech of one particular talker
in the presence of competing talkers. The speech of the target
talker can be masked by other sounds, obscuring portions of
the message, leaving it so incomplete as to be meaningless or
misunderstood. The target speech stream can be embedded in
other speech streams to the point where the listener is unable
to segregate it from the others and cannot connect the ele-
ments of the target message that belong together. The listener
can be uncertain or confused about which talker to attend to
and thereby direct attention to the wrong source of speech.
There are other acoustic and perceptual factors that come
into play as well. The target speech, competing speech, and
other sounds usually reflect off of the various surfaces of the
sound field, creating echoes that arrive at the ears delayed in
time and from various directions and that interact with the
direct sources of sound. Also, there is the normal, fundamen-
tal use of the sense of hearing to continually monitor the
auditory scene for important changes and rapidly evaluate
them when they occur, potentially interrupting and diverting
processing resources away from the target. Despite the
daunting complexity of this task, humans are normally quite
successful at selecting and comprehending the speech of one
talker among many talkers or other distracting and compet-
ing sources of sound. However, this complexity—comprised
of acoustic, perceptual, and cognitive factors—makes it ex-
tremely difficult to completely describe the processes in-
volved and how they interact. Despite the fact that this ca-
pability has been studied extensively since Cherry 共1953兲
published his famous article describing the “cocktail party
problem” 共for recent reviews see Yost, 1997; Bronkhorst,
2000; and Ebata, 2003兲, a number of important questions
remain.
Among the questions about the cocktail party problem
for which we are lacking a satisfactory answer is that of the
importance of the ability to focus attention at a point along
the spatial dimension.1Clearly, attention must be focused on
the target source of speech if it is to be fully understood, but
there are many ways to segregate the target speech stream
from other sounds and the importance of the focus of atten-
tion along the spatial dimension, per se, is not well under-
stood. Scharf 共1998兲, for example, in his review of attention
in the auditory modality, notes that most of the evidence
regarding the role of spatial focus of attention does not indi-
cate large effects. Generally, cuing uncertain locations results
in relatively small improvements in response time and, in
some cases, in accuracy 共e.g., Spence and Driver, 1994;
Mondor and Zattore, 1995; Mondor et al., 1998; Sach et al.,
2000; Woods et al., 2001兲. However, very little of this work
has used speech as the stimulus. The question addressed in
the current study is: if the speech of the target talker is not
appreciably masked by competing sounds, and the sounds
and their sources are easily segregated into distinct auditory
objects, what is the benefit of directing attention toward the
target source?
Determining the role of focused attention in the spatial
dimension is closely related to understanding how binaural
information is processed in the auditory system. There is an
extensive and compelling body of evidence in support of the
important role that binaural cues provide in hearing out a
target source among masking sources. However, binaural
a兲Portions of this work were presented at the 149th meeting of the Acoustical
of America in Vancouver, BC, Canada, May 2005.
b兲Electronic mail: gkidd@bu.edu
3804 J. Acoust. Soc. Am. 118 共6兲, December 2005 © 2005 Acoustical Society of America0001-4966/2005/118共6兲/3804/12/$22.50
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
cues may be used in different ways at different stages of
processing in the auditory system to produce a selective lis-
tening advantage. The most extensively studied binaural cues
improve the effective target-to-masker ratio 共T/M兲of the in-
put from the auditory periphery to higher neural levels.
These include the “better-ear advantage” in which the spatial
separation of target and masker improves the acoustical T/M
in one ear relative to the case in which target and masker
emanate from the same location. Spatial separation of
sources also causes interaural differences which are pro-
cessed by neurons in the binaural portions of the ascending
auditory pathway to improve the effective T/M of the stimu-
lus 共cf. Durlach, 1972; Colburn, 1996; Colburn and Durlach,
1978兲. Binaural interaction is usually thought of as occurring
automatically 共i.e., not under voluntary control兲according to
the stimulus-driven properties of these neurons. The maxi-
mum advantage of spatial separation of a speech target and a
speech-shaped noise masker in a sound field is about
8–10 dB 共larger advantages may be obtained for sources
very near the listener, e.g., Shinn-Cunningham et al., 2001兲
and is roughly equally attributable to contributions of the
better-ear advantage and binaural interaction 共Zurek, 1993;
see also Plomp, 1976; Bronkhorst and Plomp, 1988; Culling
et al., 2004兲.
When a speech target is masked by a noise, better-ear
listening and binaural interaction may almost completely ac-
count for the advantage afforded the listener by spatial sepa-
ration of sound sources. However, when the target is the
speech of one talker and the masker is the speech of another
talker共s兲, the problem is more complex and other factors
must be considered. First of all, perceptual segregation of a
human voice from a Gaussian noise is a trivial problem—
they differ in nearly every important way that might cause
them to be erroneously grouped together. Normally, a listener
has little difficulty distinguishing which object is noise and
which is speech and focusing attention on one or the other is
a simple matter. When the masker is another speech source,
however, the segregation task may be simple, but then again
it may not be, depending on how different the two talkers are
with respect to segregation cues such as fundamental fre-
quency, intonation patterns, envelope coherence across fre-
quency, timbre, etc. In such cases, segregating and directing
attention to the correct source may be difficult indeed. Fur-
thermore, similar voices are easily confused and lead to er-
rors in speech recognition even for clearly segregated
sources particularly when the listener is uncertain about
which source is the target 共e.g., Brungart et al., 2001; Arbo-
gast et al., 2002兲. In selective listening tasks involving mul-
tiple talkers, it is often unclear whether the interference ob-
served in target speech recognition is a result of masking,
failure to segregate the target, confusion and misdirected at-
tention, or some combination of factors.
In attempting to determine the role played by selective
attention, manipulating the expectation of the observer is of-
ten key. Greenberg and Larkin 共1968兲demonstrated that lis-
teners exhibit a high degree of selectivity in the frequency
domain using the “probe-signal” method in which the signal
共target兲frequency had a much higher likelihood of occur-
rence than several surrounding probe frequencies. Although
both target and probe tones were equally detectable when
presented alone, detectability was higher in the mixed case
for the more likely target tone than for the less likely probe
tones, with performance 共as a function of frequency兲resem-
bling the attenuation characteristics of a bandpass filter.
Since the initial report by Greenberg and Larkin 共1968兲, the
technique has been used by many other investigators to dem-
onstrate attentional tuning in frequency 共e.g., MacMillan and
Schwartz, 1975; Scharf et al., 1987; Schlauch and Hafter,
1991; Green and McKeown, 2001兲; time 共Wright and Dai,
1994兲; spectral shape 共Hill et al., 1998兲; and modulation fre-
quency 共Wright and Dai, 1998兲.
Arbogast and Kidd 共2000兲found evidence for “tuning”
in spatial azimuth for both accuracy and response time mea-
sures in a probe-signal frequency pattern identification task,
but the effects were relatively small and occurred when the
acoustic environment was very complex and uncertain. In
fact, most of the recent work on spatial attention has used
simple stimuli and tasks such as detection of the presence of
a tone in quiet or in noise and thus does not bear a close
correspondence to the complex multitalker problem posed
early on by Cherry 共1953兲.
Erickson et al. 共2004兲compared speech identification
performance for conditions in which the location 共simulated
under headphones using head-related transfer functions,
HRTFs兲of a target talker was chosen at random from among
four possible locations to conditions in which the location of
the talker was held constant. They measured identification
performance for a target talker in the presence of one to three
other talkers uttering similarly constructed sentences. For a
known target talker 共same voice across trials in a block of
trials兲, the performance advantage obtained by providing a
fixed location was significant when either two or three com-
peting talkers were present. The size of the advantage was
nearly 20 percentage points for the two-masker condition.
Recently, Brungart and Simpson 共2005兲extended these find-
ings to conditions where target location changed probabilis-
tically across trials within a run. As the probability of a lo-
cation transition increased, speech identification performance
decreased.
The results of the Erickson et al. 共2004兲and Brungart
and Simpson 共2005兲studies suggest that attending to a par-
ticular location along the spatial dimension, at least when the
listener knows the talker and/or has a priori knowledge
about the target sentence, can provide a significant advantage
in recognizing the speech of a target talker in the presence of
competing talkers. This effect appears to be principally due
to directed attention rather than a “better ear” advantage or
binaural analysis. An important factor in the Erickson et al.
and Brungart and Simpson studies, as well as the Arbogast
and Kidd 共2000兲study mentioned earlier, was the presence
of a high degree of uncertainty. Perhaps the role of spatial
focus of attention is revealed more readily when the listening
task is very demanding and produces a heavy processing
load on the observer.
The present study is similar to that discussed above by
Erickson et al. 共2004兲but also has some important method-
ological differences. First, a condition was tested in which
there was no a priori knowledge provided to the listener
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3805
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
共within the context of the range of uncertainty in the experi-
ment兲about the target. In that condition, the callsign that
identifies the target sentence was only provided after the
stimulus. This manipulation was intended to produce a very
high load on both the attention and the memory of the lis-
tener and is essentially a divided attention task. Second, un-
certainty about target location was varied probabilistically
over a range of values in order to produce a function relating
performance to degree of uncertainty.
II. METHODS
A. Listeners
The listeners were four normal-hearing college students
ranging in age from 19 to 22 years. Listeners were paid for
their participation.
B. Stimuli
The stimuli were sentences from the Coordinate Re-
sponse Measure 共CRM兲corpus 共Bolia et al., 2000兲. The four
male talkers were used. Sentences have the format: “Ready
关callsign兴go to 关color兴关number兴now.” For each talker, the
corpus contains sentences with all possible combinations of
eight callsigns, four colors, and eight numbers.
C. Procedures
The data were collected in a 12⫻13 ft soundfield en-
closed by a single-walled IAC booth. The walls and ceiling
were perforated metal panels and the floor was carpeted. The
acoustic characteristics of this room are described in Kidd et
al. 共2005; room condition “BARE”兲. The stimuli were pre-
sented via three Acoustic Research 215PS loudspeakers lo-
cated 5 ft. from the listener and positioned at 0° and ±60°
where 0° is directly in front of the listener and +60° is to the
listener’s right. The height of the loudspeakers was approxi-
mately the same as the height of the listener’s head when
seated. These loudspeakers were calibrated and matched in
terms of overall level at the location of the listener’s head.
Each sentence was played through a separate channel of
Tucker-Davis Technologies hardware. Sentences were con-
verted at a rate of 40 kHz by a 16-bit, eight-channel D/A
converter 共DA8兲, low-pass filtered at 20 kHz 共FT6兲, attenu-
ated 共PA4兲, and passed through power amplifiers 共Tascam兲
that were connected to the three loudspeakers.
On each trial, three sentences were presented simulta-
neously, one to each of the three loudspeakers. Each sentence
was played at 60 dB SPL. The three sentences were ran-
domly chosen on each trial with the requirement that the
talkers, callsigns, colors, and numbers of the three sentences
were all mutually exclusive. One sentence of the three was
randomly designated as the target sentence by providing the
listener with its callsign while the other two were considered
maskers. The listener’s task was to identify the color and
number from the target sentence in a 4⫻8-alternative
forced-choice procedure. A handheld keypad/LCD display
共Q-term兲was used to relay messages to the listener in the
booth and to register the listener’s responses. A warning on
the Q-term display preceded each trial. Data were collected
in blocks of 30 trials each. At the end of each block percent
correct feedback was provided for that block. Listeners par-
ticipated in the experiment in sessions of 1 1
2to 2 h each,
including several breaks. The listeners’ heads were not re-
strained, but they were instructed to face directly ahead 共0°
azimuth兲during stimulus presentation.
There were two main variables in the experiment. First,
the callsign indicating the target sentence could be provided
to the listener 共by visual display on the Q-term兲either a
minimum of 1 s before 共“callsign before”兲or immediately
after 共“callsign after”兲stimulus presentation. In both cases
the callsign display remained on the screen until after the
listener’s response was recorded and response feedback was
provided. Second, the a priori probabilities associated with
target occurrence at each location were varied. When one
loudspeaker was more likely to be the source of the target the
probabilities tested were 1, 0.8, and 0.6 and the probabilities
assigned to both of the other two loudspeakers were 0, 0.1,
and 0.2, respectively. Each callsign by probability condition
was tested separately for each of the three locations. There
was also a condition in which the target source was equally
likely among all three locations 共i.e., p=1
3兲that is referred to
as “random.” The probabilities assigned the three locations
were held constant across a block of 30 trials. The listener
was reminded of the probabilities associated with each loca-
tion at the start of every trial. The warning message that
preceded each stimulus presentation indicated the expected
percentage of trials for which the signal sentence would be
presented from each location. For example, “80-10-10” indi-
cated that the target sentence would be expected to be played
from −60° approximately 80% of the time and from the 0°
and +60° locations approximately 10% of the time each for
that block of trials. The sampling that determined target lo-
cation on any given trial was with replacement so that the
actual frequency of occurrence varied.
The combination of these variables yielded 24 condi-
tions 共2 callsigns⫻3 locations⫻4 probabilities兲. Data were
collected in two-block pairs of the same condition. After ev-
ery pair of blocks the callsign condition 共callsign before or
callsign after兲was changed with the initial callsign condition
chosen randomly for each listener. For every two blocks of
data for a given callsign before or callsign after condition,
the probability/location condition was chosen randomly
without replacement from among the 12 possible conditions
available 共3 locations⫻4 probabilities兲. A minimum of 16
blocks 共480 trials兲were collected for each callsign, location,
and probability condition. In the random probability condi-
tion, because there was no location subcondition, three times
as many blocks were collected for a minimum of 48 blocks
共1440 trials兲or the same number as in the other conditions
when summed across location. Listeners were minimally
trained in the task with a single block of 30 trials in which
the target sentence was played alone at 0° azimuth.
III. RESULTS
A. Accuracy
Performance was specified as proportion correct identi-
fication where a response was counted correct only if both
3806 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
the color and number of the target sentence were identified.
Chance performance was about 0.03 共1/32; 4 colors by 8
numbers兲. The four listeners were very similar in their per-
formance, therefore the results are displayed as group means
and standard deviations. Figure 1 shows proportion correct
identification 共symbols兲as a function of the probability of
occurrence of the target at a specific location for the callsign
before 共circles兲and callsign after 共triangles兲conditions. For
both callsign before and callsign after, performance declined
as target location uncertainty increased. For the callsign after
condition, performance decreased from a proportion correct
of about 0.91 when the target location was certain 共i.e., p
=1兲to about 0.31 when the target location was randomly
chosen among the three locations. For callsign before, pro-
portion correct identification performance was about the
same as for callsign after when the location was certain
共0.92兲and decreased to about 0.67 when the location of the
target was chosen at random. The dashed line at the bottom
indicates chance performance and the other two lines with no
symbols 共dotted and dash-dot兲will be discussed later.
In order to determine whether the trends apparent in Fig.
1 were statistically significant, the data were transformed
into arcsine units and then submitted to a repeated-measures
ANOVA with three within-subjects factors: callsign, location
共−60°, 0°, +60°兲, and probability of occurrence at a given
location. All three main factors were significant: callsign
关F共1,3兲= 59.3, p⬍0.01兴, location 关F共2,6兲= 75.5, p⬍0.001兴,
and probability 关F共3,9兲= 998.1, p⬍0.001兴. In addition, the
interaction of callsign and probability was significant
关F共3,9兲= 95.5, p⬍0.001兴as were the interactions of location
and probability 关F共6,18兲=9.95 , p⬍0.001兴and callsign and
location 关F共2,6兲= 6.2, p⬍0.05兴. The three-way interaction
was not significant 关p⬎0.05兴.
Knowing the callsign in advance potentially allowed the
listener to identify the target voice early in the stimulus and
either find and focus on the location of the target talker or
simply follow the voice of the target talker until the test
items occurred, or both. However, it appears that knowing
the callsign beforehand without knowledge about location
was less useful as a cue than the converse. The proportion
correct identification for the condition in which callsign was
certain 共callsign before兲paired with uncertain location 共p
=0.33, random兲was on average about 0.67 whereas the cer-
tain location condition 共p=1兲paired with uncertain callsign
共callsign after兲was about 0.91. When p=1 and the callsign
was given in advance, no additional improvement was noted
共0.92兲. The callsign-before random-location condition re-
sulted in roughly the same performance as the callsign-after
p=0.8 condition.
Not only are the main effects of callsign before versus
after and of probability of occurrence apparent in Fig. 1, the
significant interaction between the two factors is also obvi-
ous. There was essentially no difference between callsign
after and callsign before at p= 1, but the difference between
the two increased systematically as uncertainty increased un-
til, in the random condition, the proportion correct in the
callsign before condition was about 0.36 higher than in the
callsign after condition. The main effect of location noted
above was significant because, overall, presenting the target
from the location of +60° 共to the right兲resulted in a higher
proportion correct 共0.76兲than for the locations of −60°
共0.65兲or 0° 共0.66兲. However, as the significant interaction
between location and probability of occurrence indicates, the
effect of location also depended on uncertainty and was
mainly due to the random condition. This effect may be seen
in Fig. 2, which displays proportion correct identification as
a function of the target location uncertainty. The circles are
FIG. 1. Group mean proportion correct identification scores, and standard
deviations of the means, plotted as a function of increasing uncertainty
about target location 共a priori probability of occurrence兲. The circles repre-
sent performance in the “callsign before” condition and triangles indicate
performance in the “callsign after” condition 共see text兲. The dashed line at
the bottom represents chance performance. The lines near the data points
indicate the performance predicted by a simple single-source listener strat-
egy for callsign before 共dash-dot兲and callsign after 共dot兲conditions.
FIG. 2. Group mean proportion correct identification scores, and standard deviations of the means, subdivided according to the location from which the target
was presented.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3807
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
for the callsign before condition and the triangles are for the
callsign after condition. Each panel is for a different location.
In the p= 1 case, no additional data sorting was required.
However, for the p⬍1 cases the data were sorted according
to actual location 共that is, the proportion correct is for all
trials in which the target was presented from that particular
location as opposed to proportion correct for all trials in the
nominal location condition兲. The error bars are ±1 standard
deviation of the mean. The effect of location increases with
increasing location uncertainty with the greatest difference in
performance, as noted above, for the two random conditions,
where proportion correct identification was about 0.3 higher
for target at +60° than at −60°.
It is also of interest to determine how well listeners per-
formed when the target actually occurred at the expected
location versus when it occurred at an unexpected location.
This can be examined in the two intermediate probability
conditions. It might be expected, for example, that the listen-
ers always focused attention at the more likely location. In
that case, performance should equal that found in the p=1
condition 共i.e., about 0.92 correct兲on the trials when the
target was presented from that location. Figure 3 displays
proportion correct performance as a function of expected and
unexpected target locations. Expected location means that
the target sentence was played at the most likely location,
while unexpected location means that the target sentence was
played at one of the other two less likely locations. Circles
are for the callsign before condition and triangles are for the
callsign after condition. Unfilled symbols indicate target pre-
sentation at the expected location and filled symbols are for
target presentation at unexpected locations. Data are means
and standard deviations computed across listeners for all lo-
cations. The results from p= 1 and p= 0.33 are included here
for comparison 共same as in Fig. 1兲although there is no “un-
expected” case for p= 1 or “expected” case for p= 0.33. The
horizontal lines will be discussed later.
When the target was presented at the expected location,
proportion correct identification performance was 0.8 or
greater in all conditions. In that case, the callsign condition
did not matter and the degree of uncertainty had a relatively
minor effect, decreasing from around 0.92 for p=1 to around
0.80 for p= 0.6. The decline in observed performance with
decreasing pundoubtedly reflects a cost associated with the
greater uncertainty about location and could indicate some
attempt by the listeners to increasingly divide or distribute
attention among locations. However, the small effect sug-
gests that this had only a minor influence on performance for
trials at the expected location.
In contrast to the results obtained when the target was
presented at the expected location, target presentation from
an unexpected location led to much poorer performance with
large differences observed between callsign before and call-
sign after conditions. For callsign before, performance actu-
ally improved as uncertainty increased from a proportion cor-
rect of about 0.43 for p= 0.8 to about 0.67 for p= 0.33. The
improvement in performance with increasing target-location
uncertainty is a reasonable outcome if one assumes that there
is a substantial penalty associated with attending to the
wrong location.
Perhaps the most striking result shown in Fig. 3 is that,
for the callsign after condition and p= 0.8 or 0.6, listeners
were almost never correct 共0.02 and 0.05, respectively兲when
the target occurred at an unexpected location. Even the
knowledge that the target would occur in the more likely
location only 60% of the time still did not improve perfor-
mance for the unexpected locations. Therefore, for callsign
after, the listeners appeared to focus attention almost entirely
at the expected location.
For the callsign before p= 0.8 and 0.6 conditions, listen-
ers were correct nearly half the time 共0.43 and 0.53, respec-
tively兲when the location was unexpected. This implies that
listeners used a combination of expected location and target
callsign to perform the task. They probably were not using
target callsign alone because performance for expected loca-
tions was significantly better than for unexpected locations.
Obviously, they did not use location alone either because
they were correct for unexpected locations fairly often.
B. Predictions of a simple single-source listener
strategy
There are a number of strategies potentially available to
the listener in attempting to solve this task. It is not possible
from the data obtained in the current experiments, however,
to evaluate and decide among all of the alternative strategies
or observer models. On the other hand, it can be very useful
and informative to take a single, simple strategy and follow
through its assumptions and predictions. In this section, we
consider the predictions of one such strategy and make com-
parisons to the results described above.
There are three initial assumptions that define this lis-
tener strategy. First, it is assumed that the sources are per-
fectly segregated and that errors occur because attention is
directed to the incorrect source. It is already known, how-
ever, from the results discussed above, that performance was
not perfect even for the p= 1 case. Accordingly, the predic-
tions of listener performance that follow are scaled by a mul-
tiplier of 0.92 共the observed proportion correct in the certain
FIG. 3. Group mean proportion correct identification scores 共ordinate兲, and
standard deviations of the means, subdivided according to whether the target
was presented at the expected 共more likely; open symbols兲or an unexpected
共less likely; filled symbols兲location. The abscissa is target location uncer-
tainty. The circles represent performance for the callsign before condition
while the triangles represent performance for the callsign after condition.
The horizontal lines not connecting data points are the predictions of the
single-source listener strategy discussed in the text.
3808 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
location case兲to reflect some 共unspecified兲limitations on the
ability to identify targets at attended locations. Second, it is
assumed that the listener attends to only one source at any
given moment in time. While this is a strong assumption that
excludes models based on divided attention, it also is
straightforward to evaluate and provides a means for deter-
mining whether interpretations based on divided attention are
necessary. The results described above for targets occurring
at unexpected locations for the callsign after conditions pro-
vide some degree of support for this assumption. Third, it is
assumed that the listener always attends to the more likely
location. In the case of callsign before presentation, this pro-
vides an opportunity for the listener to switch attention to
another location if it is determined that that location does not
contain the target. This would happen whenever a nontarget
callsign is presented from the location initially attended, i.e.,
the expected location. In those instances, it is assumed that
the listener randomly chooses to focus on one of the other
two sources because at the time switching occurs the other
callsigns would have already been presented and are as-
sumed not to be useful to the listener.
In determining the performance of this hypothetical lis-
tener, certain conditional probabilities may be defined. First,
for callsign after, performance should simply be the probabil-
ity of occurrence scaled by the p= 1 value of 0.92. So, the
predicted performance for callsign after is
PCa=pPCmax,共1兲
where PCais the predicted proportion correct for the call-
sign after condition, pis the a priori probability of occur-
rence at one location, and PCmax is the highest proportion
correct possible for an attended location 共based on the p
=1 results兲. The predictions of listener performance from
this equation are shown as the dotted line in Fig. 1.
The predictions for the callsign before condition include
PCaas a term, but also include a term representing the in-
crease in performance expected by switching attention after
determining that the attended location is not the target loca-
tion. In that case
PCb=PC
a+ 0.5关共1−p兲PCmax兴,共2兲
where PCbis the predicted performance for the callsign
before condition. The predictions for performance based
on this equation are also shown in Fig. 1 as a dash-dotted
line. As a first approximation, this simple strategy ac-
counts for the group-mean accuracy results fairly well
共comparison of group mean data and lines in Fig. 1兲.
It is also possible to compare the predictions associated
with this strategy to the results shown in Fig. 3, where the
listener responses are computed according to target presen-
tation at expected and unexpected locations for the target
probabilities of 0.8 and 0.6. The predictions of this listener
strategy are straightforward. Based on the assumption that
the listener attends to the most likely location, performance
should equal PCmax whenever the target is actually presented
from that location for both the callsign before and callsign
after conditions. This prediction is shown on Fig. 3 as hori-
zontal lines 共dash-dotted and dotted for the two callsign con-
ditions兲both at a proportion correct value of 0.92. As noted
earlier, performance is slightly below the prediction, possibly
reflecting a cost associated with target location uncertainty.
When the target occurs at unexpected locations, the predic-
tions of this listener strategy differ markedly for callsign af-
ter and callsign before conditions. For callsign after, the only
information available to the listener is from the masker at the
expected location, so optimal performance would be to
choose a color/number combination from among the remain-
ing alternatives after eliminating the ones known to be inac-
curate. Thus, performance should be at around 0.05 propor-
tion correct 共3 colors ⫻7 numbers, with a small correction
for guessing on 1-PCmax proportion of the trials, shown by
the horizontal dotted line兲. In fact, the observed performance
is close to that prediction. In the random condition, we as-
sume the listener simply attends to one arbitrarily chosen
location and thus should obtain a proportion correct of p
times PCmax, or about 0.31 which, as noted above in Fig. 1, is
quite close to the data point.
For the callsign before conditions, once the masker call-
sign is heard from the expected location, the hypothetical
listener switches the focus of attention to one of the other
locations. The choice would be arbitrary because it is as-
sumed that only the callsign from the attended—and
incorrect—location was processed, so the listener would
have a 0.5 probability of selecting the correct source. Thus,
performance should be equal to PCmax times 0.5, or about
0.46 proportion correct 共shown as the dash-dotted line at that
value兲. Inspection of Fig. 3 suggests that, again, this listener
strategy predicts performance reasonably well with the ob-
tained performance for callsign before at unexpected loca-
tions near the predicted values.
Overall, this simple listener strategy was fairly success-
ful at predicting the proportion correct obtained from actual
listeners in most conditions. However, there were some ef-
fects that it cannot capture, such as the difference between
locations revealed in Fig. 2. Also, the results in Fig. 3 indi-
cate small but understandable differences between the 0.8
and 0.6 probability conditions that are not accounted for in
this simplified strategy. The results from the 0.6 condition
suggest a slightly greater tendency to distribute attention
with lower scores at the expected locations and higher scores
at unexpected locations.
C. Error analysis
It is also informative to examine the errors in identifica-
tion made by listeners as uncertainty varied. This analysis is
of interest both for attempting to understand actual listener
performance and for evaluating the listener strategy consid-
ered above which makes strong predictions about the types
of errors that should occur. We should consider four broad
categories of errors that are possible. First, the listener could
confuse the target and a masker so that the color and number
that are reported correspond to those of one of the two
maskers. That type of error could be considered misdirection
of attention due to focusing on the wrong source. This would
form the great majority of errors predicted by the listener
strategy discussed above. A second type of error would be
due to random guessing from among the alternatives or per-
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3809
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
haps guessing after eliminating one color and one number as
hypothesized above for some callsign after conditions. Errors
from random guessing might occur if performance were lim-
ited by energetic masking, where the test words 共colors and
numbers兲were obscured and guessing was the only option
available. Third, the errors could take the form of mixing
colors and numbers from among the three sources, either one
word from the target and one from a masker or one each
from the two different maskers. This type of error might
occur due to a breakdown in stream segregation. That is, the
words are available but the listener is unable to properly
connect the callsign, color, and number. And, finally, the er-
rors could contain words not presented during the trial. This
type of error could result in a mixture of words including one
target or masker word and one word not presented on the
trial 共i.e., guessed from among the other alternatives兲. This
type of error is not as easily interpreted as the other three
types because there is a fairly high chance that it could occur
due to random guessing 共9 of the 32 alternatives are pre-
sented on each trial兲. The frequency of occurrence of each of
these error types can be determined by analyzing the errors
found in the experiment.
Figure 4 shows the proportions of incorrect responses
obtained in the experiment subdivided according to the three
main types of errors that were made. These data are plotted
as a function of location uncertainty, and, because the pat-
terns of errors were very similar across listeners, are shown
as group means. The different types of errors are indicated by
the black, gray, and white portions of the stacked bars. First,
the overall height of the bars represents the proportion of all
errors in which both of the responses 共color and number兲
matched any of the six key words spoken on a given trial.
For p=1, this accounted for a proportion of the errors equal
to 0.87 while for the other values of pthis accounted for
proportions between 0.93 and 0.97 regardless of callsign
condition. What this means is that there was very little guess-
ing that occurred in any of these conditions—the key words
were confused or mixed together but they appear to have
been available to the listener. By way of comparison, the
expected proportion of errors due to random guessing in
which both responses were from the words spoken on a
given trial is 0.26 共8 of the 9 possible combinations of words
presented on a trial are errors out of 31 total possible error
combinations兲. The obtained proportion of errors of this type
overall was 0.94 共same as average of the total heights of bars
in Fig. 4兲. These findings, along with the p=1 results, sug-
gest that very little energetic masking was present in any of
these conditions and support the conclusion that the second
type of error discussed above 共random guessing兲had a neg-
ligible effect on performance.
The different shadings indicate a finer-grain analysis of
these errors. The black lower portions of the bars represent
the proportion of errors in which one of the words reported
was from the target sentence and the other word was from
one of the two masker sentences. This type of error was by
far the most common for p=1 共keeping in mind that there
were very few errors overall in this condition兲but was less
frequent for the other values of p. The occurrence of this
error would be consistent with a breakdown in the process of
speech stream segregation: the target and masker sentences
were not held separate, but were mixed. To the extent that
stream segregation involves perceptually connecting a se-
quence of sound elements that belong together, these “mix-
ing” errors reveal a failure of that process and dominated the
p=1 condition 共0.75 and 0.72 proportion of errors for call-
sign before and after, respectively兲. It also accounted for sub-
stantial proportions of errors 共ranging from about 0.18 to
0.38兲in the more uncertain location conditions. For these p
⬍1 conditions, the proportion of errors of this type were
somewhat higher for callsign before than callsign after.
The intermediate gray bars indicate the proportions of
errors in which both key words reported corresponded to the
key words from one of the two masker sentences. This was
the first category of error discussed above. When the listener
was certain about where to listen 共p=1兲, confusions with
masker talkers rarely occurred 共0.10 to 0.14 proportion of
errors兲. For the conditions containing location uncertainty,
however, masker confusions formed the most common type
of error 共with overall proportions of errors ranging from 0.50
to 0.69, with higher proportions for callsign after than for
callsign before兲. This could occur if the spatial focus of at-
tention was directed to the wrong location and the listener
reported the color and number from that location. Finally, the
white bars at the top indicate the proportions of errors when
one word was from one masker and the other word was from
the other masker. That type of error was infrequent in all
conditions, did not differ with callsign knowledge, and was
least frequent for the p= 1 case.
The remaining errors not plotted in Fig. 4 are cases in
which at least one word reported did not correspond to any
of the words spoken 共one target word and one unspoken
word; one masker word and one unspoken word; both words
unspoken兲. The probability from random guessing for one
word spoken and one word unspoken is 0.58 共18 of 31兲,
whereas the obtained proportion of errors of that type was
0.055. The expected proportion of errors due to guessing
when both words reported were unspoken would be 0.16 共5
of 31兲, but the obtained proportion was only 0.005.
FIG. 4. This bar graph indicates the proportions of various types of errors in
the speech identification task. The ordinate is the proportion of errors and
the abscissa is the degree of target location uncertainty. The different shad-
ings designate the types of errors 共see legend and text兲.
3810 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
As discussed above, the most common type of error for
the p⬍1 conditions was a masker confusion error. To what
extent does this error reflect focusing attention at the wrong
location? An analysis of errors that is relevant to this issue
involves determining the extent to which incorrect responses
in uncertain location conditions corresponded to the masker
sentence that was presented at the expected target location.
That is, the listener knew that the target was more likely at
one particular location and incorrectly reported the 共masker兲
color and number from that location. This analysis is only
possible for the 0.8 and 0.6 probability conditions. Figure 5
shows the proportion of incorrect responses that matched the
color and number from the masker sentence presented at the
expected location 共when the target was actually presented at
an unexpected location兲. The data are intersubject means and
standard deviations 共combined over the three locations兲. The
black bars are for callsign before and the white bars are for
the callsign after condition. In the callsign before condition,
the proportion of errors corresponding to the masker sen-
tence presented at the expected target location was between
0.4 and 0.5. For the callsign after condition, the correspond-
ing proportions of errors were much higher 共0.74 and 0.81兲.
These data, along with those presented in Fig. 4, are
inconsistent with the simple listener strategy described in
Sec. III B above. Although that strategy was generally suc-
cessful in accounting for the group-mean accuracy results in
the various conditions, the patterns of errors obtained were
not consistent with that strategy. The listener strategy exam-
ined above predicts that all of the errors 共except perhaps
those errors equal to 1-PCmax兲should be confusions with one
of the two maskers. That prediction was not supported by the
results. The proportions of errors for p⬍1 cases that were
not from one specific masker ranged from about 0.3–0.5 and
increased to almost 0.9 when p= 1. Thus, a substantial pro-
portion of errors occurred that could not be attributed to con-
fusing a masker with the target or simply reporting the sen-
tence from the wrong location. Furthermore, the callsign
before results shown in Fig. 5 are incompatible with the pre-
dictions of a switching strategy because nearly one-half of
the errors corresponded to reporting a masker presented at
the expected location. While these proportions are smaller
than those seen in the callsign after condition in which this
strategy is not possible, the switching strategy predicts that
none of the errors should correspond to that location. This
suggests a fairly strong tendency to report the content from
the expected location and not shift attention away from that
location during stimulus presentation. For callsign before,
that result is surprising because the listener should realize
that the expected location does not contain the target as soon
as the inaccurate callsign is heard. We do not have a good
explanation for this nonoptimal listener behavior and can
only speculate that the results reflect a strong bias in favor of
a priori information regarding location. The tendency to rely
on expected target location is even more strongly apparent
for the callsign after condition in which the proportions of
errors from the expected location were greater than 0.74. In
that case, if it were assumed that only the callsign, color, and
number from one location—the attended location—were re-
membered, then a more effective strategy would be simply to
guess from among the color and number alternatives after
excluding those from that masker. In that case, again, none of
the errors would correspond to the most likely target loca-
tion. The results in Fig. 5 reveal that listeners tended not to
adopt that strategy.
IV. DISCUSSION
The first point to be made concerns the degree to which
the findings here answer the question posed in the Introduc-
tion regarding the role of spatial focus of attention in solving
the cocktail party problem. The results of the current study
support the conclusion that focusing attention toward a target
sound source in the presence of spatially distributed maskers
can provide a very significant advantage in speech identifi-
cation performance. However, this prominent role of spatial
focus of attention may depend on other factors such as the
complexity of the listening environment and the processing
load placed on the observer.
In order to conclude that the listening advantages found
in this study may be attributed to spatial focus of attention,
we must consider the possible role of other factors such as
masking and perceptual segregation of sounds. In contrast to
studies in which the spatial separation between sources was
varied in an attempt to reveal listening advantages due to
binaural interaction and better ear listening, the spatial loca-
tion of the sources in the present study did not vary across
conditions. Thus, the binaural cues of interaural time and
level differences allowed the listener to locate the sources
within the environment, but did not differ across test condi-
tions and whatever amount of masking was present did not
influence the main variables of interest. The only case where
acoustic differences may have played a factor is in the com-
parison of performance across fixed locations. For example,
an acoustically “better ear” would occur when the target was
presented through either the left or right loudspeaker com-
pared to the center loudspeaker. As shown in Fig. 2, some
differences in performance were observed favoring the right
location in the random condition. However, this is not likely
to be an indication of binaural masking release because it
was not present consistently across conditions nor was it
FIG. 5. The proportions of incorrect responses that matched the color and
number of the masker sentence presented at the expected 共more likely兲
location when the target was presented at an unexpected 共less likely兲loca-
tion. The values plotted are group means and standard deviations for the
callsign before 共black bars兲and callsign after 共white bars兲conditions.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3811
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
present for left location presentation, which would be ex-
pected to produce roughly the same binaural cues as the
symmetrically placed right location.
It is not clear from Fig. 2 whether the difference in per-
formance across loudspeaker locations reflects a genuine per-
ceptual effect or a bias on the part of the listeners to attend to
the location on the right.2There is a long history of study of
“right-ear advantages” in dichotic listening and, although the
effect has been attributed primarily to differences in the pro-
cessing of speech in the two hemispheres of the brain 共e.g.,
Kimura, 1967; Zatorre, 1989兲, there is also evidence that the
effect is labile and susceptible to attentional focus 共e.g., As-
bjornsen and Hugdahl, 1995; Asbjornsen and Bryden, 1996兲.
A further analysis of the data was undertaken to help under-
stand this effect in the current experiment. The results of this
analysis are shown in Fig. 6. This figure shows the propor-
tions of all words reported from the more likely location
regardless of whether the words were from targets 共lower
black portions of bars兲or from maskers 共upper white por-
tions of bars兲. Note that for the random condition, there was
no “more likely location” so the bars represent all responses
to words from each location.
Inspection of Fig. 6 indicates that a strong asymmetry
between right- and left-side locations is only clearly evident
in the random condition. In that condition, not only was there
a tendency to report target words more often when they were
presented from the right side, but there was also a tendency
to report masker words more often when presented from the
right side 共primarily for the callsign after condition but in
both callsign conditions both target and masker words were
reported less often when located to the left兲. Thus, the words
presented from the right loudspeaker were chosen more often
than those from 共in particular兲the left loudspeaker regardless
of whether they were correct or incorrect. This response pat-
tern clearly reflects bias on the part of the listeners because
the expected probability of occurrence from the three loca-
tions in the random condition was equal. This bias largely
disappears when uncertainty about target location is de-
creased. Whether there is also a genuine processing differ-
ence, and whether that is in some way related to the observed
bias, cannot be ascertained from these data. Another relevant
point is that, because the stimuli were presented via loud-
speakers, the speech from all three talkers was present in
both ears on every trial. Thus, the differences in performance
must be related to the differences in acoustics as target posi-
tion varied. The acoustic differences between target and
masker talkers, though, are much less than in dichotic listen-
ing tasks where the stimuli are presented separately to the
two ears. It is not clear how the effect found here is related to
these acoustic differences. It should be noted that both Bolia
et al. 共2001兲and Brungart and Simpson 共2005兲have reported
right hemifield identification performance advantages using
the CRM test for earphone-based stimuli processed by
HRTFs.
The analysis presented in Fig. 6 also provides insight
into another issue. There is an extensive literature concerning
the tendency of subjects to adjust their response strategy to
match the a priori probabilities of different response alterna-
tives being correct, or having different payoffs, even if that
strategy is nonoptimal 共called “probability matching;” e.g.,
Shanks et al., 2002; West and Stanovich, 2003; also, review
by Vulkan, 2000兲. In the current experiments, evidence for
probability matching might be found by comparing the pro-
portions of responses to stimuli from expected locations to
the a priori probabilities of target presentation from those
locations. For callsign before, there is indeed a reasonably
close correspondence between overall response rates and the
assigned probabilities 共Fig. 6兲. However, because the callsign
was provided in addition to location probabilities, it cannot
be determined if this correspondence truly reflects matching
responses to probabilities of occurrence at the different loca-
tions or is a by-product of a combined callsign-location
weighting strategy. Furthermore, the opportunity for switch-
ing the attended location based on callsign also could pro-
duce the proportions of responses shown in Fig. 6. For call-
sign after for p= 0.8 and, especially, p=0.6, little support for
a probability matching interpretation is apparent. The ob-
tained proportions of responses from the expected locations
for those two conditions are greater than 0.9 and 0.8, respec-
tively, which are substantially above the corresponding prob-
abilities of occurrence.
The finding that identification performance was so accu-
rate when p= 1 provides support for the conclusion that, gen-
erally, masking was minimal and the sound sources could be
FIG. 6. The proportion of all words 共single words or pairs兲reported from the more likely location. This includes both words from the target 共correct
responses—lower black portion of bars兲and from a masker 共incorrect responses—upper white portion of bars兲. In each panel both callsign before and after
responses are shown at each loudspeaker location. In the random condition 共right-most panel兲these are just proportions of responses by location because there
was no “more likely” location.
3812 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
successfully segregated. This conclusion is based on the as-
sumption that both factors are necessary for successful iden-
tification to occur. However, errors attributed to a breakdown
in the perceptual segregation of the speech streams were
present in all conditions. Although these “mixing errors”
were the most common type of error for the certain-location
conditions, the errors were still very infrequent with propor-
tion correct scores over 0.9 found for both callsign condi-
tions. The success of the listeners in those conditions indi-
cates that the speech streams can be segregated, but the
maintenance of separate streams is vulnerable to increasing
uncertainty about target location. Here, when attention is fo-
cused at the wrong location, performance suffers with in-
creased uncertainty and errors consistent both with misdi-
rected attention and loss of stream segregation were found.
Furthermore, because identification performance was not
perfect even when the most a priori information was pro-
vided to the listener 共about 0.92 proportion correct for the
callsign before, p= 1 condition兲, some interference caused by
nontarget sound sources clearly was present 共proportion cor-
rect identification performance for a single-talker baseline
condition was nearly perfect兲. Thus, the two more common
types of errors were confusions between target and masker,
which we believe indicate misdirection of attention, and mix-
ing target and masker words, which probably indicates a
breakdown in maintaining speech stream segregation.
The present results also imply that surprisingly little
shifting or redirecting of attention occurred during trials.
When the callsign was provided before trials, but the target
occurred at an unexpected location 共p=0.6 and 0.8兲, nearly
one-half of the errors corresponded to the masker at the tar-
get’s most likely location. These conditions indicate a rivalry
of cues: target talker versus location. In this case, directing
attention to the wrong location clearly diminished the ability
of the listener to follow the speech of the target talker from
the callsign to the test words. These scores are worse, in fact,
than in the random case where proportion correct perfor-
mance was about 0.7. Thus, when the target talker was indi-
cated beforehand there was a penalty of directing attention to
the wrong location.
Comparison of performance in the callsign before and
callsign after conditions overall indicates the interaction be-
tween the processing load imposed by the task and the im-
portance of spatial focus of attention. Ideally, if an observer
were capable of remembering all three simultaneous sen-
tences, specifically, associating and remembering the three
key words from each source—the callsign, color, and num-
ber, then spatial information would be unnecessary and per-
formance in both callsign before and callsign after conditions
would be equivalent. Clearly, this was not possible when
there were three simultaneous talkers. According to the “load
theory of attention” proposed by Lavie et al. 共2004兲, when
observers are faced with a very demanding perceptual task—
segregating one element of an array of similar elements, for
example—the nontarget elements produce little interference
in the selection and processing of the target. This is because
of the assumption that all available perceptual resources are
occupied at any given point in time. If the resources required
to process the target occupy the entire pool of resources
available, none are left to allocate to distracting sources, so
little or no interference occurs. However, if the perceptual
task is not demanding but the subsequent cognitive control
load is, as would be the case if a high load were placed on
working memory, much greater interference from distracters
is predicted. In the current conditions, despite the fact that
the three sources are male talkers, the segregation task is
relatively easy. This conclusion is based on the high identi-
fication scores in the certain-location conditions. However,
particularly in the callsign after case, the cognitive load—the
demands on working memory that would be sufficient to
solve the task—are very high. Furthermore, when uncer-
tainty about location increased, errors in perceptual
segregation—mixing errors—increased as well 共note that
here we are referring to the combination of number of errors
as in Fig. 1 and the error type as in Fig. 4兲. If we assume that
increasing uncertainty mainly affects cognitive load, then in-
creasing errors from distracters—either misdirected attention
or loss of stream segregation—would be expected. Thus, al-
though the evidence in support of the load theory usually
takes the form of response times 共as do most data concerned
with auditory attention, with some exceptions, cf. Sach et al.,
2000; Erickson et al., 2004兲, our results appear to be quali-
tatively consistent with that theory.
We do not have data for other numbers of talkers 共except
for the control case of one talker兲but speculate that equally
strong spatial effects could be observed for four or more
talkers. This is because the three-talker condition has already
degraded performance to near the limiting case imposed by
the number of potential target sources. Assuming that each
source may be segregated from the others and that only one
message can be remembered, then performance in the call-
sign after-random condition should at best be equal to the
reciprocal of the number of sources, if each is equally likely
to be the target. That is, the listener chooses one to attend to
and does so perfectly and completely to the exclusion of the
others. For callsign after, performance in the random condi-
tion was near that which would be expected based on this
strategy. An interesting case, then, is for two simultaneous
sources. At present, we have some indications that perfor-
mance in the two-talker callsign after condition may exceed
the reciprocal of the number-of-sources limit 共Gallun et al.,
2005兲.
Recently, the two- vs three-source comparison for mul-
titalker speech identification has been considered by Brun-
gart and Simpson 共2002兲. In their study, the task was the
same as here—reporting the color and number for a desig-
nated callsign in the CRM task—and there were either one or
two other talkers also uttering sentences from the CRM task.
However, the stimuli were presented via earphones. When
the target, which was presented to the “ipsilateral” ear, was
mixed with one masker in the same ear, performance varied
predictably according to T/M. When one masker was pre-
sented to the contralateral ear, no interference in target iden-
tification was observed; i.e., identification performance ap-
proached 100% correct. When both maskers—one ipsilateral
and one contralateral—were presented, target identification
performance was as much as 40 percentage points poorer
than the single 共ipsilateral兲masker condition. Brungart and
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3813
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
Simpson 共2002兲attributed this large decline in performance
to the much greater difficulty in ignoring two sources, rather
than one source, when there was a difficult segregation task
to perform in the target ear. Kidd et al. 共2003兲found a par-
allel effect using complex nonspeech stimuli. The greater
difficulty in three-talker conditions than in two-talker condi-
tions has been noted by Yost et al. 共1996兲and Hawley et al.
共2004兲. In the current study, we speculate that the high pro-
cessing load caused by three simultaneous talkers contrib-
uted to the large beneficial effect of a priori information,
especially that of the most likely target location.
V. SUMMARY AND CONCLUSIONS
A priori knowledge about where to direct attention pro-
vided significant advantages in speech identification in
highly uncertain multitalker listening conditions. This result
was found for both cued and uncued target sentence presen-
tation. The differences in performance found across the vari-
ous conditions could not be attributed to binaural analysis,
masking, or perceptual segregation, but rather appear to be
uniquely related to focus of attention. The pattern of errors
suggested a high degree of spatial selectivity with very little
processing of speech originating from unattended locations.
A simple single-source listener strategy was found to predict
accuracy fairly well in most conditions but failed to account
for the observed patterns of errors. The current results sup-
port the view that spatial focus of attention can be a very
important factor in complex and uncertain multisource listen-
ing environments and may play a crucial role in solving the
“cocktail party problem.”
ACKNOWLEDGMENTS
The authors are grateful to Kelly Egan for her assis-
tance. They would also like to thank Nathaniel I. Durlach
and Barbara Shinn-Cunningham for comments on an earlier
version of the manuscript. This work was supported by
Grants Nos. DC00100, DC04545, and DC04663 from NIH/
NIDCD and by AFOSR Award No. FA9550-05-1-2005. Fre-
derick Gallun was supported by F32 DC006526 from
NIDCD.
1The use of the term “spatial dimension” in this article is confined to varia-
tions in sound source azimuth. A complete description of sound source
location includes distance from the listener and elevation, which are not
considered here.
2Preferential selection of objects or events on the right side has been ob-
served in many tasks other than speech recognition 共e.g., Nisbett and Wil-
son, 1977兲.
Arbogast, T. L., and Kidd, G., Jr. 共2000兲. “Evidence for spatial tuning in
informational masking using the probe-signal method,” J. Acoust. Soc.
Am. 108, 1803–1810.
Arbogast, T. L., Mason, C. R., and Kidd, G., Jr. 共2002兲. “The effect of
spatial separation on informational and energetic masking of speech,” J.
Acoust. Soc. Am. 112 , 2086–2098.
Asbjornsen, A. E., and Bryden, M. P. 共1996兲. “Biased attention and the
fused dichotic words test,” Neuropsychologia 34,407–411.
Asbjornsen, A., and Hugdahl, K. 共1995兲. “Attentional effects in dichotic
listening,” Brain Lang 49, 189–201.
Bolia, R. S., Nelson, T. W., and Morley, R. M. 共2001兲. “Asymmetric per-
formance in the cocktail party effect: Implications for the design of spatial
audio displays,” Hum. Factors 43, 208–216.
Bolia, R. S., Nelson, W. T., Ericson, M. A., and Simpson, B. D. 共2000兲.“A
speech corpus for multitalker communications research,” J. Acoust. Soc.
Am. 107, 1065–1066.
Bronkhorst, A. W. 共2000兲. “The cocktail party phenomenon: A review of
research on speech intelligibility in multiple-talker conditions,” Acust.
Acta Acust. 86, 117–128.
Bronkhorst, A. W., and Plomp, R. 共1988兲. “The effect of head-induced in-
teraural time and level differences on speech intelligibility in noise,” J.
Acoust. Soc. Am. 83, 1508–1516.
Brungart, D. S., and Simpson, B. D. 共2002兲. “Within-ear and across-ear
interference in a cocktail-party listening task,” J. Acoust. Soc. Am. 112,
2985–2995.
Brungart, D. S., and Simpson, B. D. 共2005兲. “Cocktail party listening in a
dynamic multitalker environment,” 共unpublished兲.
Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. 共2001兲.
“Informational and energetic masking effects in the perception of multiple
simultaneous talkers,” J. Acoust. Soc. Am. 11 0, 2527–2538.
Cherry, E. C. 共1953兲. “Some experiments on the recognition of speech, with
one and two ears,” J. Acoust. Soc. Am. 25, 975–979.
Colburn, H. S. 共1996兲. “Computational models of binaural processing,” in
Auditory Computation, edited by H. Hawkins, T. McMullin, A. N. Popper,
and R. R. Fay 共Springer-Verlag, New York兲, pp. 332–400.
Colburn, H. S., and Durlach, N. I. 共1978兲. “Models of binaural interaction,”
in Handbook of Perception, Vol. IV, Hearing, edited by E. C. Carterette
and M. P. Friedman 共Academic, New York兲.
Culling, J. F., Hawley, M. L., and Litovsky, R. Y. 共2004兲. “The role of
head-induced interaural time and level differences in the speech reception
threshold for multiple interfering sound sources,” J. Acoust. Soc. Am.
116 , 1057.
Durlach, N. I. 共1972兲. “Binaural signal detection: Equalization and cancel-
lation theory,” in Foundations of Modern Auditory Theory, Vol. II, edited
by J. V. Tobias 共Academic, New York兲.
Ebata, M. 共2003兲. “Spatial unmasking and attention related to the cocktail
party problem,” Acoust. Sci. Tech. 24, 208–219.
Erickson, M. A., Brungart, D. S., and Simpson, B. D. 共2004兲. “Factors that
influence intelligibility in multitalker speech displays,” J. Aviation Psych.
14, 311–332.
Gallun, F. J., Mason, C. R., and Kidd, G., Jr. 共2005兲. “Task-dependent costs
in processing two simultaneous auditory stimuli,” 共unpublished兲.
Green, T. J., and McKeown, J. D. 共2001兲. “Capture of attention in selective
frequency listening,” J. Exp. Psychol. Hum. Percept. Perform. 27, 1197–
1210.
Greenberg, G. S., and Larkin, W. D. 共1968兲. “Frequency-response charac-
teristic of auditory observers detecting signals of a single frequency in
noise: The probe-signal method,” J. Acoust. Soc. Am. 44, 1513–1523.
Hawley, M. L., Litovsky, R. Y., and Culling, J. F. 共2004兲. “The benefit of
binaural hearing in a cocktail party: Effect of location and type of inter-
ferer,” J. Acoust. Soc. Am. 115, 833– 843.
Hill, N. I., Bailey, P. J., and Hodgson, P. 共1998兲. “A probe-signal study of
auditory discrimination of complex tones,” J. Acoust. Soc. Am. 102,
2291–2296.
Kidd, G., Jr., Mason, C. R., Brughera, A., and Hartmann, W. M. 共2005兲.
“The role of reverberation in release from masking due to spatial separa-
tion of sources for speech identification,” Acust. Acta Acust. 91, 526 –536.
Kidd, G., Jr., Mason, C. R., Arbogast, T. L., Brungart, D., and Simpson, B.
共2003兲. “Informational masking caused by contralateral stimulation,” J.
Acoust. Soc. Am. 113 , 1594–1603.
Kimura, D. 共1967兲. “Functional asymmetry of the brain in dichotic listen-
ing,” Cortex 22, 163–178.
Lavie, N., Hirst, A., de Fockert, J. W., and Viding, E. 共2004兲. “Load theory
of selective attention and cognitive control,” J. Exp. Psychol. Gen. 133,
339–354.
Macmillan, N. A., and Schwartz, M. 共1975兲. “A probe-signal investigation
of uncertain-frequency detection,” J. Acoust. Soc. Am. 58, 1051–1058.
Mondor, T. A., and Zatorre, R. J. 共1995兲. “Shifting and focusing auditory
spatial attention,” J. Exp. Psychol. Hum. Percept. Perform. 21, 397–409.
Mondor, T. A., Zattore, R. J., and Terrio, N. A. 共1998兲. “Constraints on the
selection of auditory information,” J. Exp. Psychol. Hum. Percept. Per-
form. 24, 66–79.
Nisbett, R. E., and Wilson, T. C. 共1977兲. “Telling more than we know:
Verbal reports on mental processes,” Psychol. Rev. 84, 231–259.
Plomp, R. 共1976兲. “Binaural and monaural speech intelligibility of con-
nected discourse in reverberation as a function of azimuth of a single
3814 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
competing sound source 共speech or noise兲,” Acustica 34, 200–211.
Sach, A. J., Hill, N. I., and Bailey, P. J. 共2000兲. “Auditory spatial attention
using interaural time differences,” J. Exp. Psychol. Hum. Percept. Per-
form. 26, 717–729.
Scharf, B. 共1998兲. “Auditory attention: The psychoacoustical approach,” in
Attention, edited by H. Pashler 共Psychology Press, East Sussex, UK兲, pp.
75–117.
Scharf, B., Quigley, S., Aoki, C., Peachey, N., and Reeves, A. 共1987兲.“Fo-
cused auditory attention and frequency selectivity,” Percept. Psychophys.
42, 215–223.
Schlauch, R. S., and Hafter, E. R. 共1991兲. “Listening bandwidths and fre-
quency uncertainty in pure-tone signal detection,” J. Acoust. Soc. Am. 90,
1332–1339.
Shanks, D. R., Tunney, R. J., and McCarthy, J. D. 共2002兲. “A re-
examination of probability matching and rational choice,” J. Behav. Dec.
Making 15, 233–250.
Shinn-Cunningham, B. G., Schickler, J., Kopco, N., and Litovsky, R. Y.
共2001兲. “Spatial unmasking of nearby speech sources in a simulated
anechoic environment,” J. Acoust. Soc. Am. 11 0, 1118–1129.
Spence, C. J., and Driver, J. 共1994兲. “Covert spatial orienting in audition:
Exogenous and endogenous mechanisms,” J. Exp. Psychol. Hum. Percept.
Perform. 20, 555–574.
Vulkan, N. 共2000兲. “An economist’s perspective on probability matching,” J.
Econ. Surveys 14, 101–118.
West, R. F., and Stanovich, K. E. 共2003兲. “Is probability matching smart?
Associations between probabilistic choices and cognitive ability,” Mem.
Cognit. 31, 243–251.
Woods, D. L., Alain, C., Diaz, R., Rhodes, D., and Ogawa, K. H. 共2001兲.
“Location and frequency cues in auditory selective attention,” J. Exp.
Psychol. Hum. Percept. Perform. 27, 65–74.
Wright, B. A., and Dai, H. 共1994兲. “Detection of unexpected tones with
short and long durations,” J. Acoust. Soc. Am. 95, 931–938.
Wright, B. A., and Dai, H. 共1998兲. “Detection of sinusoidal amplitude
modulation at unexpected rates,” J. Acoust. Soc. Am. 104, 2991–2997.
Yost, W. A. 共1997兲. “The cocktail party problem: Forty years later,” in
Binaural and Spatial Hearing in Real and Virtual Environments, edited by
R. A. Gilkey and T. R. Anderson 共Erlbaum, Hillsdale, NJ兲, pp. 329–348.
Yost, W. A., Dye, R. H., and Sheft, S. 共1996兲. “A simulated ‘cocktail party’
with up to three sound sources,” Percept. Psychophys. 58, 1026–1036.
Zattore, R. J. 共1989兲. “Perceptual asymmetry on the dichotic fused words
test and cerebral speech lateralization determined by the carotoid sodium
amytal test,” Neuropsychologia 27, 1207–1219.
Zurek, P. M. 共1993兲. “Binaural advantages and directional effects in speech
intelligibility,” in Acoustical Factors Affecting Hearing Aid Performance,
edited by G. A. Studebaker and I. Hochberg 共Allyn and Bacon, Boston兲,
pp. 255–276.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3815
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24