DataPDF Available

1.2109187

Authors:
The advantage of knowing where to listena)
Gerald Kidd, Jr.,bTanya L. Arbogast, Christine R. Mason, and Frederick J. Gallun
Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University,
635 Commonwealth Avenue, Boston, Massachusetts 02215
Received 31 May 2005; revised 9 September 2005; accepted 12 September 2005
This study examined the role of focused attention along the spatial azimuthaldimension in a
highly uncertain multitalker listening situation. The task of the listener was to identify key words
from a target talker in the presence of two other talkers simultaneously uttering similar sentences.
When the listener had no a priori knowledge about target location, or which of the three sentences
was the target sentence, performance was relatively poor—near the value expected simply from
choosing to focus attention on only one of the three locations. When the target sentence was cued
before the trial, but location was uncertain, performance improved significantly relative to the
uncued case. When spatial location information was provided before the trial, performance
improved significantly for both cued and uncued conditions. If the location of the target was certain,
proportion correct identification performance was higher than 0.9 independent of whether the target
was cued beforehand. In contrast to studies in which known versus unknown spatial locations were
compared for relatively simple stimuli and tasks, the results of the current experiments suggest that
the focus of attention along the spatial dimension can play a very significant role in solving the
“cocktail party” problem. © 2005 Acoustical Society of America. DOI: 10.1121/1.2109187
PACS numbers: 43.66.Lj, 43.66.Dc, 43.66.Pn AJOPages: 3804–3815
I. INTRODUCTION
There are many factors that can interfere with a listener
attempting to comprehend the speech of one particular talker
in the presence of competing talkers. The speech of the target
talker can be masked by other sounds, obscuring portions of
the message, leaving it so incomplete as to be meaningless or
misunderstood. The target speech stream can be embedded in
other speech streams to the point where the listener is unable
to segregate it from the others and cannot connect the ele-
ments of the target message that belong together. The listener
can be uncertain or confused about which talker to attend to
and thereby direct attention to the wrong source of speech.
There are other acoustic and perceptual factors that come
into play as well. The target speech, competing speech, and
other sounds usually reflect off of the various surfaces of the
sound field, creating echoes that arrive at the ears delayed in
time and from various directions and that interact with the
direct sources of sound. Also, there is the normal, fundamen-
tal use of the sense of hearing to continually monitor the
auditory scene for important changes and rapidly evaluate
them when they occur, potentially interrupting and diverting
processing resources away from the target. Despite the
daunting complexity of this task, humans are normally quite
successful at selecting and comprehending the speech of one
talker among many talkers or other distracting and compet-
ing sources of sound. However, this complexity—comprised
of acoustic, perceptual, and cognitive factors—makes it ex-
tremely difficult to completely describe the processes in-
volved and how they interact. Despite the fact that this ca-
pability has been studied extensively since Cherry 1953
published his famous article describing the “cocktail party
problem” for recent reviews see Yost, 1997; Bronkhorst,
2000; and Ebata, 2003, a number of important questions
remain.
Among the questions about the cocktail party problem
for which we are lacking a satisfactory answer is that of the
importance of the ability to focus attention at a point along
the spatial dimension.1Clearly, attention must be focused on
the target source of speech if it is to be fully understood, but
there are many ways to segregate the target speech stream
from other sounds and the importance of the focus of atten-
tion along the spatial dimension, per se, is not well under-
stood. Scharf 1998, for example, in his review of attention
in the auditory modality, notes that most of the evidence
regarding the role of spatial focus of attention does not indi-
cate large effects. Generally, cuing uncertain locations results
in relatively small improvements in response time and, in
some cases, in accuracy e.g., Spence and Driver, 1994;
Mondor and Zattore, 1995; Mondor et al., 1998; Sach et al.,
2000; Woods et al., 2001. However, very little of this work
has used speech as the stimulus. The question addressed in
the current study is: if the speech of the target talker is not
appreciably masked by competing sounds, and the sounds
and their sources are easily segregated into distinct auditory
objects, what is the benefit of directing attention toward the
target source?
Determining the role of focused attention in the spatial
dimension is closely related to understanding how binaural
information is processed in the auditory system. There is an
extensive and compelling body of evidence in support of the
important role that binaural cues provide in hearing out a
target source among masking sources. However, binaural
aPortions of this work were presented at the 149th meeting of the Acoustical
of America in Vancouver, BC, Canada, May 2005.
bElectronic mail: gkidd@bu.edu
3804 J. Acoust. Soc. Am. 118 6, December 2005 © 2005 Acoustical Society of America0001-4966/2005/1186/3804/12/$22.50
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
cues may be used in different ways at different stages of
processing in the auditory system to produce a selective lis-
tening advantage. The most extensively studied binaural cues
improve the effective target-to-masker ratio T/Mof the in-
put from the auditory periphery to higher neural levels.
These include the “better-ear advantage” in which the spatial
separation of target and masker improves the acoustical T/M
in one ear relative to the case in which target and masker
emanate from the same location. Spatial separation of
sources also causes interaural differences which are pro-
cessed by neurons in the binaural portions of the ascending
auditory pathway to improve the effective T/M of the stimu-
lus cf. Durlach, 1972; Colburn, 1996; Colburn and Durlach,
1978. Binaural interaction is usually thought of as occurring
automatically i.e., not under voluntary controlaccording to
the stimulus-driven properties of these neurons. The maxi-
mum advantage of spatial separation of a speech target and a
speech-shaped noise masker in a sound field is about
8–10 dB larger advantages may be obtained for sources
very near the listener, e.g., Shinn-Cunningham et al., 2001
and is roughly equally attributable to contributions of the
better-ear advantage and binaural interaction Zurek, 1993;
see also Plomp, 1976; Bronkhorst and Plomp, 1988; Culling
et al., 2004.
When a speech target is masked by a noise, better-ear
listening and binaural interaction may almost completely ac-
count for the advantage afforded the listener by spatial sepa-
ration of sound sources. However, when the target is the
speech of one talker and the masker is the speech of another
talkers, the problem is more complex and other factors
must be considered. First of all, perceptual segregation of a
human voice from a Gaussian noise is a trivial problem—
they differ in nearly every important way that might cause
them to be erroneously grouped together. Normally, a listener
has little difficulty distinguishing which object is noise and
which is speech and focusing attention on one or the other is
a simple matter. When the masker is another speech source,
however, the segregation task may be simple, but then again
it may not be, depending on how different the two talkers are
with respect to segregation cues such as fundamental fre-
quency, intonation patterns, envelope coherence across fre-
quency, timbre, etc. In such cases, segregating and directing
attention to the correct source may be difficult indeed. Fur-
thermore, similar voices are easily confused and lead to er-
rors in speech recognition even for clearly segregated
sources particularly when the listener is uncertain about
which source is the target e.g., Brungart et al., 2001; Arbo-
gast et al., 2002. In selective listening tasks involving mul-
tiple talkers, it is often unclear whether the interference ob-
served in target speech recognition is a result of masking,
failure to segregate the target, confusion and misdirected at-
tention, or some combination of factors.
In attempting to determine the role played by selective
attention, manipulating the expectation of the observer is of-
ten key. Greenberg and Larkin 1968demonstrated that lis-
teners exhibit a high degree of selectivity in the frequency
domain using the “probe-signal” method in which the signal
targetfrequency had a much higher likelihood of occur-
rence than several surrounding probe frequencies. Although
both target and probe tones were equally detectable when
presented alone, detectability was higher in the mixed case
for the more likely target tone than for the less likely probe
tones, with performance as a function of frequencyresem-
bling the attenuation characteristics of a bandpass filter.
Since the initial report by Greenberg and Larkin 1968, the
technique has been used by many other investigators to dem-
onstrate attentional tuning in frequency e.g., MacMillan and
Schwartz, 1975; Scharf et al., 1987; Schlauch and Hafter,
1991; Green and McKeown, 2001; time Wright and Dai,
1994; spectral shape Hill et al., 1998; and modulation fre-
quency Wright and Dai, 1998.
Arbogast and Kidd 2000found evidence for “tuning”
in spatial azimuth for both accuracy and response time mea-
sures in a probe-signal frequency pattern identification task,
but the effects were relatively small and occurred when the
acoustic environment was very complex and uncertain. In
fact, most of the recent work on spatial attention has used
simple stimuli and tasks such as detection of the presence of
a tone in quiet or in noise and thus does not bear a close
correspondence to the complex multitalker problem posed
early on by Cherry 1953.
Erickson et al. 2004compared speech identification
performance for conditions in which the location simulated
under headphones using head-related transfer functions,
HRTFsof a target talker was chosen at random from among
four possible locations to conditions in which the location of
the talker was held constant. They measured identification
performance for a target talker in the presence of one to three
other talkers uttering similarly constructed sentences. For a
known target talker same voice across trials in a block of
trials, the performance advantage obtained by providing a
fixed location was significant when either two or three com-
peting talkers were present. The size of the advantage was
nearly 20 percentage points for the two-masker condition.
Recently, Brungart and Simpson 2005extended these find-
ings to conditions where target location changed probabilis-
tically across trials within a run. As the probability of a lo-
cation transition increased, speech identification performance
decreased.
The results of the Erickson et al. 2004and Brungart
and Simpson 2005studies suggest that attending to a par-
ticular location along the spatial dimension, at least when the
listener knows the talker and/or has a priori knowledge
about the target sentence, can provide a significant advantage
in recognizing the speech of a target talker in the presence of
competing talkers. This effect appears to be principally due
to directed attention rather than a “better ear” advantage or
binaural analysis. An important factor in the Erickson et al.
and Brungart and Simpson studies, as well as the Arbogast
and Kidd 2000study mentioned earlier, was the presence
of a high degree of uncertainty. Perhaps the role of spatial
focus of attention is revealed more readily when the listening
task is very demanding and produces a heavy processing
load on the observer.
The present study is similar to that discussed above by
Erickson et al. 2004but also has some important method-
ological differences. First, a condition was tested in which
there was no a priori knowledge provided to the listener
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3805
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
within the context of the range of uncertainty in the experi-
mentabout the target. In that condition, the callsign that
identifies the target sentence was only provided after the
stimulus. This manipulation was intended to produce a very
high load on both the attention and the memory of the lis-
tener and is essentially a divided attention task. Second, un-
certainty about target location was varied probabilistically
over a range of values in order to produce a function relating
performance to degree of uncertainty.
II. METHODS
A. Listeners
The listeners were four normal-hearing college students
ranging in age from 19 to 22 years. Listeners were paid for
their participation.
B. Stimuli
The stimuli were sentences from the Coordinate Re-
sponse Measure CRMcorpus Bolia et al., 2000. The four
male talkers were used. Sentences have the format: “Ready
callsigngo to color兴关numbernow.” For each talker, the
corpus contains sentences with all possible combinations of
eight callsigns, four colors, and eight numbers.
C. Procedures
The data were collected in a 1213 ft soundfield en-
closed by a single-walled IAC booth. The walls and ceiling
were perforated metal panels and the floor was carpeted. The
acoustic characteristics of this room are described in Kidd et
al. 2005; room condition “BARE”. The stimuli were pre-
sented via three Acoustic Research 215PS loudspeakers lo-
cated 5 ft. from the listener and positioned at 0° and ±60°
where 0° is directly in front of the listener and +60° is to the
listener’s right. The height of the loudspeakers was approxi-
mately the same as the height of the listener’s head when
seated. These loudspeakers were calibrated and matched in
terms of overall level at the location of the listener’s head.
Each sentence was played through a separate channel of
Tucker-Davis Technologies hardware. Sentences were con-
verted at a rate of 40 kHz by a 16-bit, eight-channel D/A
converter DA8, low-pass filtered at 20 kHz FT6, attenu-
ated PA4, and passed through power amplifiers Tascam
that were connected to the three loudspeakers.
On each trial, three sentences were presented simulta-
neously, one to each of the three loudspeakers. Each sentence
was played at 60 dB SPL. The three sentences were ran-
domly chosen on each trial with the requirement that the
talkers, callsigns, colors, and numbers of the three sentences
were all mutually exclusive. One sentence of the three was
randomly designated as the target sentence by providing the
listener with its callsign while the other two were considered
maskers. The listener’s task was to identify the color and
number from the target sentence in a 48-alternative
forced-choice procedure. A handheld keypad/LCD display
Q-termwas used to relay messages to the listener in the
booth and to register the listener’s responses. A warning on
the Q-term display preceded each trial. Data were collected
in blocks of 30 trials each. At the end of each block percent
correct feedback was provided for that block. Listeners par-
ticipated in the experiment in sessions of 1 1
2to 2 h each,
including several breaks. The listeners’ heads were not re-
strained, but they were instructed to face directly ahead
azimuthduring stimulus presentation.
There were two main variables in the experiment. First,
the callsign indicating the target sentence could be provided
to the listener by visual display on the Q-termeither a
minimum of 1 s before “callsign before”or immediately
after “callsign after”stimulus presentation. In both cases
the callsign display remained on the screen until after the
listener’s response was recorded and response feedback was
provided. Second, the a priori probabilities associated with
target occurrence at each location were varied. When one
loudspeaker was more likely to be the source of the target the
probabilities tested were 1, 0.8, and 0.6 and the probabilities
assigned to both of the other two loudspeakers were 0, 0.1,
and 0.2, respectively. Each callsign by probability condition
was tested separately for each of the three locations. There
was also a condition in which the target source was equally
likely among all three locations i.e., p=1
3that is referred to
as “random.” The probabilities assigned the three locations
were held constant across a block of 30 trials. The listener
was reminded of the probabilities associated with each loca-
tion at the start of every trial. The warning message that
preceded each stimulus presentation indicated the expected
percentage of trials for which the signal sentence would be
presented from each location. For example, “80-10-10” indi-
cated that the target sentence would be expected to be played
from −60° approximately 80% of the time and from the 0°
and +60° locations approximately 10% of the time each for
that block of trials. The sampling that determined target lo-
cation on any given trial was with replacement so that the
actual frequency of occurrence varied.
The combination of these variables yielded 24 condi-
tions 2 callsigns3 locations4 probabilities. Data were
collected in two-block pairs of the same condition. After ev-
ery pair of blocks the callsign condition callsign before or
callsign afterwas changed with the initial callsign condition
chosen randomly for each listener. For every two blocks of
data for a given callsign before or callsign after condition,
the probability/location condition was chosen randomly
without replacement from among the 12 possible conditions
available 3 locations4 probabilities. A minimum of 16
blocks 480 trialswere collected for each callsign, location,
and probability condition. In the random probability condi-
tion, because there was no location subcondition, three times
as many blocks were collected for a minimum of 48 blocks
1440 trialsor the same number as in the other conditions
when summed across location. Listeners were minimally
trained in the task with a single block of 30 trials in which
the target sentence was played alone at 0° azimuth.
III. RESULTS
A. Accuracy
Performance was specified as proportion correct identi-
fication where a response was counted correct only if both
3806 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
the color and number of the target sentence were identified.
Chance performance was about 0.03 1/32; 4 colors by 8
numbers. The four listeners were very similar in their per-
formance, therefore the results are displayed as group means
and standard deviations. Figure 1 shows proportion correct
identification symbolsas a function of the probability of
occurrence of the target at a specific location for the callsign
before circlesand callsign after trianglesconditions. For
both callsign before and callsign after, performance declined
as target location uncertainty increased. For the callsign after
condition, performance decreased from a proportion correct
of about 0.91 when the target location was certain i.e., p
=1to about 0.31 when the target location was randomly
chosen among the three locations. For callsign before, pro-
portion correct identification performance was about the
same as for callsign after when the location was certain
0.92and decreased to about 0.67 when the location of the
target was chosen at random. The dashed line at the bottom
indicates chance performance and the other two lines with no
symbols dotted and dash-dotwill be discussed later.
In order to determine whether the trends apparent in Fig.
1 were statistically significant, the data were transformed
into arcsine units and then submitted to a repeated-measures
ANOVA with three within-subjects factors: callsign, location
−60°, 0°, +60°, and probability of occurrence at a given
location. All three main factors were significant: callsign
F1,3= 59.3, p0.01, location F2,6= 75.5, p0.001,
and probability F3,9= 998.1, p0.001. In addition, the
interaction of callsign and probability was significant
F3,9= 95.5, p0.001as were the interactions of location
and probability F6,18=9.95 , p0.001and callsign and
location F2,6= 6.2, p0.05. The three-way interaction
was not significant p0.05.
Knowing the callsign in advance potentially allowed the
listener to identify the target voice early in the stimulus and
either find and focus on the location of the target talker or
simply follow the voice of the target talker until the test
items occurred, or both. However, it appears that knowing
the callsign beforehand without knowledge about location
was less useful as a cue than the converse. The proportion
correct identification for the condition in which callsign was
certain callsign beforepaired with uncertain location p
=0.33, randomwas on average about 0.67 whereas the cer-
tain location condition p=1paired with uncertain callsign
callsign afterwas about 0.91. When p=1 and the callsign
was given in advance, no additional improvement was noted
0.92. The callsign-before random-location condition re-
sulted in roughly the same performance as the callsign-after
p=0.8 condition.
Not only are the main effects of callsign before versus
after and of probability of occurrence apparent in Fig. 1, the
significant interaction between the two factors is also obvi-
ous. There was essentially no difference between callsign
after and callsign before at p= 1, but the difference between
the two increased systematically as uncertainty increased un-
til, in the random condition, the proportion correct in the
callsign before condition was about 0.36 higher than in the
callsign after condition. The main effect of location noted
above was significant because, overall, presenting the target
from the location of +60° to the rightresulted in a higher
proportion correct 0.76than for the locations of −60°
0.65or 0° 0.66. However, as the significant interaction
between location and probability of occurrence indicates, the
effect of location also depended on uncertainty and was
mainly due to the random condition. This effect may be seen
in Fig. 2, which displays proportion correct identification as
a function of the target location uncertainty. The circles are
FIG. 1. Group mean proportion correct identification scores, and standard
deviations of the means, plotted as a function of increasing uncertainty
about target location a priori probability of occurrence. The circles repre-
sent performance in the “callsign before” condition and triangles indicate
performance in the “callsign after” condition see text. The dashed line at
the bottom represents chance performance. The lines near the data points
indicate the performance predicted by a simple single-source listener strat-
egy for callsign before dash-dotand callsign after dotconditions.
FIG. 2. Group mean proportion correct identification scores, and standard deviations of the means, subdivided according to the location from which the target
was presented.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3807
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
for the callsign before condition and the triangles are for the
callsign after condition. Each panel is for a different location.
In the p= 1 case, no additional data sorting was required.
However, for the p1 cases the data were sorted according
to actual location that is, the proportion correct is for all
trials in which the target was presented from that particular
location as opposed to proportion correct for all trials in the
nominal location condition. The error bars are ±1 standard
deviation of the mean. The effect of location increases with
increasing location uncertainty with the greatest difference in
performance, as noted above, for the two random conditions,
where proportion correct identification was about 0.3 higher
for target at +60° than at −60°.
It is also of interest to determine how well listeners per-
formed when the target actually occurred at the expected
location versus when it occurred at an unexpected location.
This can be examined in the two intermediate probability
conditions. It might be expected, for example, that the listen-
ers always focused attention at the more likely location. In
that case, performance should equal that found in the p=1
condition i.e., about 0.92 correcton the trials when the
target was presented from that location. Figure 3 displays
proportion correct performance as a function of expected and
unexpected target locations. Expected location means that
the target sentence was played at the most likely location,
while unexpected location means that the target sentence was
played at one of the other two less likely locations. Circles
are for the callsign before condition and triangles are for the
callsign after condition. Unfilled symbols indicate target pre-
sentation at the expected location and filled symbols are for
target presentation at unexpected locations. Data are means
and standard deviations computed across listeners for all lo-
cations. The results from p= 1 and p= 0.33 are included here
for comparison same as in Fig. 1although there is no “un-
expected” case for p= 1 or “expected” case for p= 0.33. The
horizontal lines will be discussed later.
When the target was presented at the expected location,
proportion correct identification performance was 0.8 or
greater in all conditions. In that case, the callsign condition
did not matter and the degree of uncertainty had a relatively
minor effect, decreasing from around 0.92 for p=1 to around
0.80 for p= 0.6. The decline in observed performance with
decreasing pundoubtedly reflects a cost associated with the
greater uncertainty about location and could indicate some
attempt by the listeners to increasingly divide or distribute
attention among locations. However, the small effect sug-
gests that this had only a minor influence on performance for
trials at the expected location.
In contrast to the results obtained when the target was
presented at the expected location, target presentation from
an unexpected location led to much poorer performance with
large differences observed between callsign before and call-
sign after conditions. For callsign before, performance actu-
ally improved as uncertainty increased from a proportion cor-
rect of about 0.43 for p= 0.8 to about 0.67 for p= 0.33. The
improvement in performance with increasing target-location
uncertainty is a reasonable outcome if one assumes that there
is a substantial penalty associated with attending to the
wrong location.
Perhaps the most striking result shown in Fig. 3 is that,
for the callsign after condition and p= 0.8 or 0.6, listeners
were almost never correct 0.02 and 0.05, respectivelywhen
the target occurred at an unexpected location. Even the
knowledge that the target would occur in the more likely
location only 60% of the time still did not improve perfor-
mance for the unexpected locations. Therefore, for callsign
after, the listeners appeared to focus attention almost entirely
at the expected location.
For the callsign before p= 0.8 and 0.6 conditions, listen-
ers were correct nearly half the time 0.43 and 0.53, respec-
tivelywhen the location was unexpected. This implies that
listeners used a combination of expected location and target
callsign to perform the task. They probably were not using
target callsign alone because performance for expected loca-
tions was significantly better than for unexpected locations.
Obviously, they did not use location alone either because
they were correct for unexpected locations fairly often.
B. Predictions of a simple single-source listener
strategy
There are a number of strategies potentially available to
the listener in attempting to solve this task. It is not possible
from the data obtained in the current experiments, however,
to evaluate and decide among all of the alternative strategies
or observer models. On the other hand, it can be very useful
and informative to take a single, simple strategy and follow
through its assumptions and predictions. In this section, we
consider the predictions of one such strategy and make com-
parisons to the results described above.
There are three initial assumptions that define this lis-
tener strategy. First, it is assumed that the sources are per-
fectly segregated and that errors occur because attention is
directed to the incorrect source. It is already known, how-
ever, from the results discussed above, that performance was
not perfect even for the p= 1 case. Accordingly, the predic-
tions of listener performance that follow are scaled by a mul-
tiplier of 0.92 the observed proportion correct in the certain
FIG. 3. Group mean proportion correct identification scores ordinate, and
standard deviations of the means, subdivided according to whether the target
was presented at the expected more likely; open symbolsor an unexpected
less likely; filled symbolslocation. The abscissa is target location uncer-
tainty. The circles represent performance for the callsign before condition
while the triangles represent performance for the callsign after condition.
The horizontal lines not connecting data points are the predictions of the
single-source listener strategy discussed in the text.
3808 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
location caseto reflect some unspecifiedlimitations on the
ability to identify targets at attended locations. Second, it is
assumed that the listener attends to only one source at any
given moment in time. While this is a strong assumption that
excludes models based on divided attention, it also is
straightforward to evaluate and provides a means for deter-
mining whether interpretations based on divided attention are
necessary. The results described above for targets occurring
at unexpected locations for the callsign after conditions pro-
vide some degree of support for this assumption. Third, it is
assumed that the listener always attends to the more likely
location. In the case of callsign before presentation, this pro-
vides an opportunity for the listener to switch attention to
another location if it is determined that that location does not
contain the target. This would happen whenever a nontarget
callsign is presented from the location initially attended, i.e.,
the expected location. In those instances, it is assumed that
the listener randomly chooses to focus on one of the other
two sources because at the time switching occurs the other
callsigns would have already been presented and are as-
sumed not to be useful to the listener.
In determining the performance of this hypothetical lis-
tener, certain conditional probabilities may be defined. First,
for callsign after, performance should simply be the probabil-
ity of occurrence scaled by the p= 1 value of 0.92. So, the
predicted performance for callsign after is
PCa=pPCmax,1
where PCais the predicted proportion correct for the call-
sign after condition, pis the a priori probability of occur-
rence at one location, and PCmax is the highest proportion
correct possible for an attended location based on the p
=1 results. The predictions of listener performance from
this equation are shown as the dotted line in Fig. 1.
The predictions for the callsign before condition include
PCaas a term, but also include a term representing the in-
crease in performance expected by switching attention after
determining that the attended location is not the target loca-
tion. In that case
PCb=PC
a+ 0.5关共1−pPCmax,2
where PCbis the predicted performance for the callsign
before condition. The predictions for performance based
on this equation are also shown in Fig. 1 as a dash-dotted
line. As a first approximation, this simple strategy ac-
counts for the group-mean accuracy results fairly well
comparison of group mean data and lines in Fig. 1.
It is also possible to compare the predictions associated
with this strategy to the results shown in Fig. 3, where the
listener responses are computed according to target presen-
tation at expected and unexpected locations for the target
probabilities of 0.8 and 0.6. The predictions of this listener
strategy are straightforward. Based on the assumption that
the listener attends to the most likely location, performance
should equal PCmax whenever the target is actually presented
from that location for both the callsign before and callsign
after conditions. This prediction is shown on Fig. 3 as hori-
zontal lines dash-dotted and dotted for the two callsign con-
ditionsboth at a proportion correct value of 0.92. As noted
earlier, performance is slightly below the prediction, possibly
reflecting a cost associated with target location uncertainty.
When the target occurs at unexpected locations, the predic-
tions of this listener strategy differ markedly for callsign af-
ter and callsign before conditions. For callsign after, the only
information available to the listener is from the masker at the
expected location, so optimal performance would be to
choose a color/number combination from among the remain-
ing alternatives after eliminating the ones known to be inac-
curate. Thus, performance should be at around 0.05 propor-
tion correct 3 colors 7 numbers, with a small correction
for guessing on 1-PCmax proportion of the trials, shown by
the horizontal dotted line. In fact, the observed performance
is close to that prediction. In the random condition, we as-
sume the listener simply attends to one arbitrarily chosen
location and thus should obtain a proportion correct of p
times PCmax, or about 0.31 which, as noted above in Fig. 1, is
quite close to the data point.
For the callsign before conditions, once the masker call-
sign is heard from the expected location, the hypothetical
listener switches the focus of attention to one of the other
locations. The choice would be arbitrary because it is as-
sumed that only the callsign from the attended—and
incorrect—location was processed, so the listener would
have a 0.5 probability of selecting the correct source. Thus,
performance should be equal to PCmax times 0.5, or about
0.46 proportion correct shown as the dash-dotted line at that
value. Inspection of Fig. 3 suggests that, again, this listener
strategy predicts performance reasonably well with the ob-
tained performance for callsign before at unexpected loca-
tions near the predicted values.
Overall, this simple listener strategy was fairly success-
ful at predicting the proportion correct obtained from actual
listeners in most conditions. However, there were some ef-
fects that it cannot capture, such as the difference between
locations revealed in Fig. 2. Also, the results in Fig. 3 indi-
cate small but understandable differences between the 0.8
and 0.6 probability conditions that are not accounted for in
this simplified strategy. The results from the 0.6 condition
suggest a slightly greater tendency to distribute attention
with lower scores at the expected locations and higher scores
at unexpected locations.
C. Error analysis
It is also informative to examine the errors in identifica-
tion made by listeners as uncertainty varied. This analysis is
of interest both for attempting to understand actual listener
performance and for evaluating the listener strategy consid-
ered above which makes strong predictions about the types
of errors that should occur. We should consider four broad
categories of errors that are possible. First, the listener could
confuse the target and a masker so that the color and number
that are reported correspond to those of one of the two
maskers. That type of error could be considered misdirection
of attention due to focusing on the wrong source. This would
form the great majority of errors predicted by the listener
strategy discussed above. A second type of error would be
due to random guessing from among the alternatives or per-
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3809
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
haps guessing after eliminating one color and one number as
hypothesized above for some callsign after conditions. Errors
from random guessing might occur if performance were lim-
ited by energetic masking, where the test words colors and
numberswere obscured and guessing was the only option
available. Third, the errors could take the form of mixing
colors and numbers from among the three sources, either one
word from the target and one from a masker or one each
from the two different maskers. This type of error might
occur due to a breakdown in stream segregation. That is, the
words are available but the listener is unable to properly
connect the callsign, color, and number. And, finally, the er-
rors could contain words not presented during the trial. This
type of error could result in a mixture of words including one
target or masker word and one word not presented on the
trial i.e., guessed from among the other alternatives. This
type of error is not as easily interpreted as the other three
types because there is a fairly high chance that it could occur
due to random guessing 9 of the 32 alternatives are pre-
sented on each trial. The frequency of occurrence of each of
these error types can be determined by analyzing the errors
found in the experiment.
Figure 4 shows the proportions of incorrect responses
obtained in the experiment subdivided according to the three
main types of errors that were made. These data are plotted
as a function of location uncertainty, and, because the pat-
terns of errors were very similar across listeners, are shown
as group means. The different types of errors are indicated by
the black, gray, and white portions of the stacked bars. First,
the overall height of the bars represents the proportion of all
errors in which both of the responses color and number
matched any of the six key words spoken on a given trial.
For p=1, this accounted for a proportion of the errors equal
to 0.87 while for the other values of pthis accounted for
proportions between 0.93 and 0.97 regardless of callsign
condition. What this means is that there was very little guess-
ing that occurred in any of these conditions—the key words
were confused or mixed together but they appear to have
been available to the listener. By way of comparison, the
expected proportion of errors due to random guessing in
which both responses were from the words spoken on a
given trial is 0.26 8 of the 9 possible combinations of words
presented on a trial are errors out of 31 total possible error
combinations. The obtained proportion of errors of this type
overall was 0.94 same as average of the total heights of bars
in Fig. 4. These findings, along with the p=1 results, sug-
gest that very little energetic masking was present in any of
these conditions and support the conclusion that the second
type of error discussed above random guessinghad a neg-
ligible effect on performance.
The different shadings indicate a finer-grain analysis of
these errors. The black lower portions of the bars represent
the proportion of errors in which one of the words reported
was from the target sentence and the other word was from
one of the two masker sentences. This type of error was by
far the most common for p=1 keeping in mind that there
were very few errors overall in this conditionbut was less
frequent for the other values of p. The occurrence of this
error would be consistent with a breakdown in the process of
speech stream segregation: the target and masker sentences
were not held separate, but were mixed. To the extent that
stream segregation involves perceptually connecting a se-
quence of sound elements that belong together, these “mix-
ing” errors reveal a failure of that process and dominated the
p=1 condition 0.75 and 0.72 proportion of errors for call-
sign before and after, respectively. It also accounted for sub-
stantial proportions of errors ranging from about 0.18 to
0.38in the more uncertain location conditions. For these p
1 conditions, the proportion of errors of this type were
somewhat higher for callsign before than callsign after.
The intermediate gray bars indicate the proportions of
errors in which both key words reported corresponded to the
key words from one of the two masker sentences. This was
the first category of error discussed above. When the listener
was certain about where to listen p=1, confusions with
masker talkers rarely occurred 0.10 to 0.14 proportion of
errors. For the conditions containing location uncertainty,
however, masker confusions formed the most common type
of error with overall proportions of errors ranging from 0.50
to 0.69, with higher proportions for callsign after than for
callsign before. This could occur if the spatial focus of at-
tention was directed to the wrong location and the listener
reported the color and number from that location. Finally, the
white bars at the top indicate the proportions of errors when
one word was from one masker and the other word was from
the other masker. That type of error was infrequent in all
conditions, did not differ with callsign knowledge, and was
least frequent for the p= 1 case.
The remaining errors not plotted in Fig. 4 are cases in
which at least one word reported did not correspond to any
of the words spoken one target word and one unspoken
word; one masker word and one unspoken word; both words
unspoken. The probability from random guessing for one
word spoken and one word unspoken is 0.58 18 of 31,
whereas the obtained proportion of errors of that type was
0.055. The expected proportion of errors due to guessing
when both words reported were unspoken would be 0.16 5
of 31, but the obtained proportion was only 0.005.
FIG. 4. This bar graph indicates the proportions of various types of errors in
the speech identification task. The ordinate is the proportion of errors and
the abscissa is the degree of target location uncertainty. The different shad-
ings designate the types of errors see legend and text.
3810 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
As discussed above, the most common type of error for
the p1 conditions was a masker confusion error. To what
extent does this error reflect focusing attention at the wrong
location? An analysis of errors that is relevant to this issue
involves determining the extent to which incorrect responses
in uncertain location conditions corresponded to the masker
sentence that was presented at the expected target location.
That is, the listener knew that the target was more likely at
one particular location and incorrectly reported the masker
color and number from that location. This analysis is only
possible for the 0.8 and 0.6 probability conditions. Figure 5
shows the proportion of incorrect responses that matched the
color and number from the masker sentence presented at the
expected location when the target was actually presented at
an unexpected location. The data are intersubject means and
standard deviations combined over the three locations. The
black bars are for callsign before and the white bars are for
the callsign after condition. In the callsign before condition,
the proportion of errors corresponding to the masker sen-
tence presented at the expected target location was between
0.4 and 0.5. For the callsign after condition, the correspond-
ing proportions of errors were much higher 0.74 and 0.81.
These data, along with those presented in Fig. 4, are
inconsistent with the simple listener strategy described in
Sec. III B above. Although that strategy was generally suc-
cessful in accounting for the group-mean accuracy results in
the various conditions, the patterns of errors obtained were
not consistent with that strategy. The listener strategy exam-
ined above predicts that all of the errors except perhaps
those errors equal to 1-PCmaxshould be confusions with one
of the two maskers. That prediction was not supported by the
results. The proportions of errors for p1 cases that were
not from one specific masker ranged from about 0.3–0.5 and
increased to almost 0.9 when p= 1. Thus, a substantial pro-
portion of errors occurred that could not be attributed to con-
fusing a masker with the target or simply reporting the sen-
tence from the wrong location. Furthermore, the callsign
before results shown in Fig. 5 are incompatible with the pre-
dictions of a switching strategy because nearly one-half of
the errors corresponded to reporting a masker presented at
the expected location. While these proportions are smaller
than those seen in the callsign after condition in which this
strategy is not possible, the switching strategy predicts that
none of the errors should correspond to that location. This
suggests a fairly strong tendency to report the content from
the expected location and not shift attention away from that
location during stimulus presentation. For callsign before,
that result is surprising because the listener should realize
that the expected location does not contain the target as soon
as the inaccurate callsign is heard. We do not have a good
explanation for this nonoptimal listener behavior and can
only speculate that the results reflect a strong bias in favor of
a priori information regarding location. The tendency to rely
on expected target location is even more strongly apparent
for the callsign after condition in which the proportions of
errors from the expected location were greater than 0.74. In
that case, if it were assumed that only the callsign, color, and
number from one location—the attended location—were re-
membered, then a more effective strategy would be simply to
guess from among the color and number alternatives after
excluding those from that masker. In that case, again, none of
the errors would correspond to the most likely target loca-
tion. The results in Fig. 5 reveal that listeners tended not to
adopt that strategy.
IV. DISCUSSION
The first point to be made concerns the degree to which
the findings here answer the question posed in the Introduc-
tion regarding the role of spatial focus of attention in solving
the cocktail party problem. The results of the current study
support the conclusion that focusing attention toward a target
sound source in the presence of spatially distributed maskers
can provide a very significant advantage in speech identifi-
cation performance. However, this prominent role of spatial
focus of attention may depend on other factors such as the
complexity of the listening environment and the processing
load placed on the observer.
In order to conclude that the listening advantages found
in this study may be attributed to spatial focus of attention,
we must consider the possible role of other factors such as
masking and perceptual segregation of sounds. In contrast to
studies in which the spatial separation between sources was
varied in an attempt to reveal listening advantages due to
binaural interaction and better ear listening, the spatial loca-
tion of the sources in the present study did not vary across
conditions. Thus, the binaural cues of interaural time and
level differences allowed the listener to locate the sources
within the environment, but did not differ across test condi-
tions and whatever amount of masking was present did not
influence the main variables of interest. The only case where
acoustic differences may have played a factor is in the com-
parison of performance across fixed locations. For example,
an acoustically “better ear” would occur when the target was
presented through either the left or right loudspeaker com-
pared to the center loudspeaker. As shown in Fig. 2, some
differences in performance were observed favoring the right
location in the random condition. However, this is not likely
to be an indication of binaural masking release because it
was not present consistently across conditions nor was it
FIG. 5. The proportions of incorrect responses that matched the color and
number of the masker sentence presented at the expected more likely
location when the target was presented at an unexpected less likelyloca-
tion. The values plotted are group means and standard deviations for the
callsign before black barsand callsign after white barsconditions.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3811
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
present for left location presentation, which would be ex-
pected to produce roughly the same binaural cues as the
symmetrically placed right location.
It is not clear from Fig. 2 whether the difference in per-
formance across loudspeaker locations reflects a genuine per-
ceptual effect or a bias on the part of the listeners to attend to
the location on the right.2There is a long history of study of
“right-ear advantages” in dichotic listening and, although the
effect has been attributed primarily to differences in the pro-
cessing of speech in the two hemispheres of the brain e.g.,
Kimura, 1967; Zatorre, 1989, there is also evidence that the
effect is labile and susceptible to attentional focus e.g., As-
bjornsen and Hugdahl, 1995; Asbjornsen and Bryden, 1996.
A further analysis of the data was undertaken to help under-
stand this effect in the current experiment. The results of this
analysis are shown in Fig. 6. This figure shows the propor-
tions of all words reported from the more likely location
regardless of whether the words were from targets lower
black portions of barsor from maskers upper white por-
tions of bars. Note that for the random condition, there was
no “more likely location” so the bars represent all responses
to words from each location.
Inspection of Fig. 6 indicates that a strong asymmetry
between right- and left-side locations is only clearly evident
in the random condition. In that condition, not only was there
a tendency to report target words more often when they were
presented from the right side, but there was also a tendency
to report masker words more often when presented from the
right side primarily for the callsign after condition but in
both callsign conditions both target and masker words were
reported less often when located to the left. Thus, the words
presented from the right loudspeaker were chosen more often
than those from in particularthe left loudspeaker regardless
of whether they were correct or incorrect. This response pat-
tern clearly reflects bias on the part of the listeners because
the expected probability of occurrence from the three loca-
tions in the random condition was equal. This bias largely
disappears when uncertainty about target location is de-
creased. Whether there is also a genuine processing differ-
ence, and whether that is in some way related to the observed
bias, cannot be ascertained from these data. Another relevant
point is that, because the stimuli were presented via loud-
speakers, the speech from all three talkers was present in
both ears on every trial. Thus, the differences in performance
must be related to the differences in acoustics as target posi-
tion varied. The acoustic differences between target and
masker talkers, though, are much less than in dichotic listen-
ing tasks where the stimuli are presented separately to the
two ears. It is not clear how the effect found here is related to
these acoustic differences. It should be noted that both Bolia
et al. 2001and Brungart and Simpson 2005have reported
right hemifield identification performance advantages using
the CRM test for earphone-based stimuli processed by
HRTFs.
The analysis presented in Fig. 6 also provides insight
into another issue. There is an extensive literature concerning
the tendency of subjects to adjust their response strategy to
match the a priori probabilities of different response alterna-
tives being correct, or having different payoffs, even if that
strategy is nonoptimal called “probability matching;” e.g.,
Shanks et al., 2002; West and Stanovich, 2003; also, review
by Vulkan, 2000. In the current experiments, evidence for
probability matching might be found by comparing the pro-
portions of responses to stimuli from expected locations to
the a priori probabilities of target presentation from those
locations. For callsign before, there is indeed a reasonably
close correspondence between overall response rates and the
assigned probabilities Fig. 6. However, because the callsign
was provided in addition to location probabilities, it cannot
be determined if this correspondence truly reflects matching
responses to probabilities of occurrence at the different loca-
tions or is a by-product of a combined callsign-location
weighting strategy. Furthermore, the opportunity for switch-
ing the attended location based on callsign also could pro-
duce the proportions of responses shown in Fig. 6. For call-
sign after for p= 0.8 and, especially, p=0.6, little support for
a probability matching interpretation is apparent. The ob-
tained proportions of responses from the expected locations
for those two conditions are greater than 0.9 and 0.8, respec-
tively, which are substantially above the corresponding prob-
abilities of occurrence.
The finding that identification performance was so accu-
rate when p= 1 provides support for the conclusion that, gen-
erally, masking was minimal and the sound sources could be
FIG. 6. The proportion of all words single words or pairsreported from the more likely location. This includes both words from the target correct
responses—lower black portion of barsand from a masker incorrect responses—upper white portion of bars. In each panel both callsign before and after
responses are shown at each loudspeaker location. In the random condition right-most panelthese are just proportions of responses by location because there
was no “more likely” location.
3812 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
successfully segregated. This conclusion is based on the as-
sumption that both factors are necessary for successful iden-
tification to occur. However, errors attributed to a breakdown
in the perceptual segregation of the speech streams were
present in all conditions. Although these “mixing errors”
were the most common type of error for the certain-location
conditions, the errors were still very infrequent with propor-
tion correct scores over 0.9 found for both callsign condi-
tions. The success of the listeners in those conditions indi-
cates that the speech streams can be segregated, but the
maintenance of separate streams is vulnerable to increasing
uncertainty about target location. Here, when attention is fo-
cused at the wrong location, performance suffers with in-
creased uncertainty and errors consistent both with misdi-
rected attention and loss of stream segregation were found.
Furthermore, because identification performance was not
perfect even when the most a priori information was pro-
vided to the listener about 0.92 proportion correct for the
callsign before, p= 1 condition, some interference caused by
nontarget sound sources clearly was present proportion cor-
rect identification performance for a single-talker baseline
condition was nearly perfect. Thus, the two more common
types of errors were confusions between target and masker,
which we believe indicate misdirection of attention, and mix-
ing target and masker words, which probably indicates a
breakdown in maintaining speech stream segregation.
The present results also imply that surprisingly little
shifting or redirecting of attention occurred during trials.
When the callsign was provided before trials, but the target
occurred at an unexpected location p=0.6 and 0.8, nearly
one-half of the errors corresponded to the masker at the tar-
get’s most likely location. These conditions indicate a rivalry
of cues: target talker versus location. In this case, directing
attention to the wrong location clearly diminished the ability
of the listener to follow the speech of the target talker from
the callsign to the test words. These scores are worse, in fact,
than in the random case where proportion correct perfor-
mance was about 0.7. Thus, when the target talker was indi-
cated beforehand there was a penalty of directing attention to
the wrong location.
Comparison of performance in the callsign before and
callsign after conditions overall indicates the interaction be-
tween the processing load imposed by the task and the im-
portance of spatial focus of attention. Ideally, if an observer
were capable of remembering all three simultaneous sen-
tences, specifically, associating and remembering the three
key words from each source—the callsign, color, and num-
ber, then spatial information would be unnecessary and per-
formance in both callsign before and callsign after conditions
would be equivalent. Clearly, this was not possible when
there were three simultaneous talkers. According to the “load
theory of attention” proposed by Lavie et al. 2004, when
observers are faced with a very demanding perceptual task—
segregating one element of an array of similar elements, for
example—the nontarget elements produce little interference
in the selection and processing of the target. This is because
of the assumption that all available perceptual resources are
occupied at any given point in time. If the resources required
to process the target occupy the entire pool of resources
available, none are left to allocate to distracting sources, so
little or no interference occurs. However, if the perceptual
task is not demanding but the subsequent cognitive control
load is, as would be the case if a high load were placed on
working memory, much greater interference from distracters
is predicted. In the current conditions, despite the fact that
the three sources are male talkers, the segregation task is
relatively easy. This conclusion is based on the high identi-
fication scores in the certain-location conditions. However,
particularly in the callsign after case, the cognitive load—the
demands on working memory that would be sufficient to
solve the task—are very high. Furthermore, when uncer-
tainty about location increased, errors in perceptual
segregation—mixing errors—increased as well note that
here we are referring to the combination of number of errors
as in Fig. 1 and the error type as in Fig. 4. If we assume that
increasing uncertainty mainly affects cognitive load, then in-
creasing errors from distracters—either misdirected attention
or loss of stream segregation—would be expected. Thus, al-
though the evidence in support of the load theory usually
takes the form of response times as do most data concerned
with auditory attention, with some exceptions, cf. Sach et al.,
2000; Erickson et al., 2004, our results appear to be quali-
tatively consistent with that theory.
We do not have data for other numbers of talkers except
for the control case of one talkerbut speculate that equally
strong spatial effects could be observed for four or more
talkers. This is because the three-talker condition has already
degraded performance to near the limiting case imposed by
the number of potential target sources. Assuming that each
source may be segregated from the others and that only one
message can be remembered, then performance in the call-
sign after-random condition should at best be equal to the
reciprocal of the number of sources, if each is equally likely
to be the target. That is, the listener chooses one to attend to
and does so perfectly and completely to the exclusion of the
others. For callsign after, performance in the random condi-
tion was near that which would be expected based on this
strategy. An interesting case, then, is for two simultaneous
sources. At present, we have some indications that perfor-
mance in the two-talker callsign after condition may exceed
the reciprocal of the number-of-sources limit Gallun et al.,
2005.
Recently, the two- vs three-source comparison for mul-
titalker speech identification has been considered by Brun-
gart and Simpson 2002. In their study, the task was the
same as here—reporting the color and number for a desig-
nated callsign in the CRM task—and there were either one or
two other talkers also uttering sentences from the CRM task.
However, the stimuli were presented via earphones. When
the target, which was presented to the “ipsilateral” ear, was
mixed with one masker in the same ear, performance varied
predictably according to T/M. When one masker was pre-
sented to the contralateral ear, no interference in target iden-
tification was observed; i.e., identification performance ap-
proached 100% correct. When both maskers—one ipsilateral
and one contralateral—were presented, target identification
performance was as much as 40 percentage points poorer
than the single ipsilateralmasker condition. Brungart and
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3813
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
Simpson 2002attributed this large decline in performance
to the much greater difficulty in ignoring two sources, rather
than one source, when there was a difficult segregation task
to perform in the target ear. Kidd et al. 2003found a par-
allel effect using complex nonspeech stimuli. The greater
difficulty in three-talker conditions than in two-talker condi-
tions has been noted by Yost et al. 1996and Hawley et al.
2004. In the current study, we speculate that the high pro-
cessing load caused by three simultaneous talkers contrib-
uted to the large beneficial effect of a priori information,
especially that of the most likely target location.
V. SUMMARY AND CONCLUSIONS
A priori knowledge about where to direct attention pro-
vided significant advantages in speech identification in
highly uncertain multitalker listening conditions. This result
was found for both cued and uncued target sentence presen-
tation. The differences in performance found across the vari-
ous conditions could not be attributed to binaural analysis,
masking, or perceptual segregation, but rather appear to be
uniquely related to focus of attention. The pattern of errors
suggested a high degree of spatial selectivity with very little
processing of speech originating from unattended locations.
A simple single-source listener strategy was found to predict
accuracy fairly well in most conditions but failed to account
for the observed patterns of errors. The current results sup-
port the view that spatial focus of attention can be a very
important factor in complex and uncertain multisource listen-
ing environments and may play a crucial role in solving the
“cocktail party problem.”
ACKNOWLEDGMENTS
The authors are grateful to Kelly Egan for her assis-
tance. They would also like to thank Nathaniel I. Durlach
and Barbara Shinn-Cunningham for comments on an earlier
version of the manuscript. This work was supported by
Grants Nos. DC00100, DC04545, and DC04663 from NIH/
NIDCD and by AFOSR Award No. FA9550-05-1-2005. Fre-
derick Gallun was supported by F32 DC006526 from
NIDCD.
1The use of the term “spatial dimension” in this article is confined to varia-
tions in sound source azimuth. A complete description of sound source
location includes distance from the listener and elevation, which are not
considered here.
2Preferential selection of objects or events on the right side has been ob-
served in many tasks other than speech recognition e.g., Nisbett and Wil-
son, 1977.
Arbogast, T. L., and Kidd, G., Jr. 2000. “Evidence for spatial tuning in
informational masking using the probe-signal method,” J. Acoust. Soc.
Am. 108, 1803–1810.
Arbogast, T. L., Mason, C. R., and Kidd, G., Jr. 2002. “The effect of
spatial separation on informational and energetic masking of speech,” J.
Acoust. Soc. Am. 112 , 2086–2098.
Asbjornsen, A. E., and Bryden, M. P. 1996. “Biased attention and the
fused dichotic words test,” Neuropsychologia 34,407411.
Asbjornsen, A., and Hugdahl, K. 1995. “Attentional effects in dichotic
listening,” Brain Lang 49, 189–201.
Bolia, R. S., Nelson, T. W., and Morley, R. M. 2001. “Asymmetric per-
formance in the cocktail party effect: Implications for the design of spatial
audio displays,” Hum. Factors 43, 208–216.
Bolia, R. S., Nelson, W. T., Ericson, M. A., and Simpson, B. D. 2000.“A
speech corpus for multitalker communications research,” J. Acoust. Soc.
Am. 107, 1065–1066.
Bronkhorst, A. W. 2000. “The cocktail party phenomenon: A review of
research on speech intelligibility in multiple-talker conditions,” Acust.
Acta Acust. 86, 117–128.
Bronkhorst, A. W., and Plomp, R. 1988. “The effect of head-induced in-
teraural time and level differences on speech intelligibility in noise,” J.
Acoust. Soc. Am. 83, 1508–1516.
Brungart, D. S., and Simpson, B. D. 2002. “Within-ear and across-ear
interference in a cocktail-party listening task,” J. Acoust. Soc. Am. 112,
2985–2995.
Brungart, D. S., and Simpson, B. D. 2005. “Cocktail party listening in a
dynamic multitalker environment,” unpublished.
Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. 2001.
“Informational and energetic masking effects in the perception of multiple
simultaneous talkers,” J. Acoust. Soc. Am. 11 0, 2527–2538.
Cherry, E. C. 1953. “Some experiments on the recognition of speech, with
one and two ears,” J. Acoust. Soc. Am. 25, 975–979.
Colburn, H. S. 1996. “Computational models of binaural processing,” in
Auditory Computation, edited by H. Hawkins, T. McMullin, A. N. Popper,
and R. R. Fay Springer-Verlag, New York, pp. 332–400.
Colburn, H. S., and Durlach, N. I. 1978. “Models of binaural interaction,”
in Handbook of Perception, Vol. IV, Hearing, edited by E. C. Carterette
and M. P. Friedman Academic, New York.
Culling, J. F., Hawley, M. L., and Litovsky, R. Y. 2004. “The role of
head-induced interaural time and level differences in the speech reception
threshold for multiple interfering sound sources,” J. Acoust. Soc. Am.
116 , 1057.
Durlach, N. I. 1972. “Binaural signal detection: Equalization and cancel-
lation theory,” in Foundations of Modern Auditory Theory, Vol. II, edited
by J. V. Tobias Academic, New York.
Ebata, M. 2003. “Spatial unmasking and attention related to the cocktail
party problem,” Acoust. Sci. Tech. 24, 208–219.
Erickson, M. A., Brungart, D. S., and Simpson, B. D. 2004. “Factors that
influence intelligibility in multitalker speech displays,” J. Aviation Psych.
14, 311–332.
Gallun, F. J., Mason, C. R., and Kidd, G., Jr. 2005. “Task-dependent costs
in processing two simultaneous auditory stimuli,” unpublished.
Green, T. J., and McKeown, J. D. 2001. “Capture of attention in selective
frequency listening,” J. Exp. Psychol. Hum. Percept. Perform. 27, 1197–
1210.
Greenberg, G. S., and Larkin, W. D. 1968. “Frequency-response charac-
teristic of auditory observers detecting signals of a single frequency in
noise: The probe-signal method,” J. Acoust. Soc. Am. 44, 1513–1523.
Hawley, M. L., Litovsky, R. Y., and Culling, J. F. 2004. “The benefit of
binaural hearing in a cocktail party: Effect of location and type of inter-
ferer,” J. Acoust. Soc. Am. 115, 833– 843.
Hill, N. I., Bailey, P. J., and Hodgson, P. 1998. “A probe-signal study of
auditory discrimination of complex tones,” J. Acoust. Soc. Am. 102,
2291–2296.
Kidd, G., Jr., Mason, C. R., Brughera, A., and Hartmann, W. M. 2005.
“The role of reverberation in release from masking due to spatial separa-
tion of sources for speech identification,” Acust. Acta Acust. 91, 526 –536.
Kidd, G., Jr., Mason, C. R., Arbogast, T. L., Brungart, D., and Simpson, B.
2003. “Informational masking caused by contralateral stimulation,” J.
Acoust. Soc. Am. 113 , 1594–1603.
Kimura, D. 1967. “Functional asymmetry of the brain in dichotic listen-
ing,” Cortex 22, 163–178.
Lavie, N., Hirst, A., de Fockert, J. W., and Viding, E. 2004. “Load theory
of selective attention and cognitive control,” J. Exp. Psychol. Gen. 133,
339–354.
Macmillan, N. A., and Schwartz, M. 1975. “A probe-signal investigation
of uncertain-frequency detection,” J. Acoust. Soc. Am. 58, 1051–1058.
Mondor, T. A., and Zatorre, R. J. 1995. “Shifting and focusing auditory
spatial attention,” J. Exp. Psychol. Hum. Percept. Perform. 21, 397–409.
Mondor, T. A., Zattore, R. J., and Terrio, N. A. 1998. “Constraints on the
selection of auditory information,” J. Exp. Psychol. Hum. Percept. Per-
form. 24, 66–79.
Nisbett, R. E., and Wilson, T. C. 1977. “Telling more than we know:
Verbal reports on mental processes,” Psychol. Rev. 84, 231–259.
Plomp, R. 1976. “Binaural and monaural speech intelligibility of con-
nected discourse in reverberation as a function of azimuth of a single
3814 J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24
competing sound source speech or noise,” Acustica 34, 200–211.
Sach, A. J., Hill, N. I., and Bailey, P. J. 2000. “Auditory spatial attention
using interaural time differences,” J. Exp. Psychol. Hum. Percept. Per-
form. 26, 717–729.
Scharf, B. 1998. “Auditory attention: The psychoacoustical approach,” in
Attention, edited by H. Pashler Psychology Press, East Sussex, UK, pp.
75–117.
Scharf, B., Quigley, S., Aoki, C., Peachey, N., and Reeves, A. 1987.“Fo-
cused auditory attention and frequency selectivity,” Percept. Psychophys.
42, 215–223.
Schlauch, R. S., and Hafter, E. R. 1991. “Listening bandwidths and fre-
quency uncertainty in pure-tone signal detection,” J. Acoust. Soc. Am. 90,
1332–1339.
Shanks, D. R., Tunney, R. J., and McCarthy, J. D. 2002. “A re-
examination of probability matching and rational choice,” J. Behav. Dec.
Making 15, 233–250.
Shinn-Cunningham, B. G., Schickler, J., Kopco, N., and Litovsky, R. Y.
2001. “Spatial unmasking of nearby speech sources in a simulated
anechoic environment,” J. Acoust. Soc. Am. 11 0, 1118–1129.
Spence, C. J., and Driver, J. 1994. “Covert spatial orienting in audition:
Exogenous and endogenous mechanisms,” J. Exp. Psychol. Hum. Percept.
Perform. 20, 555–574.
Vulkan, N. 2000. “An economist’s perspective on probability matching,” J.
Econ. Surveys 14, 101–118.
West, R. F., and Stanovich, K. E. 2003. “Is probability matching smart?
Associations between probabilistic choices and cognitive ability,” Mem.
Cognit. 31, 243–251.
Woods, D. L., Alain, C., Diaz, R., Rhodes, D., and Ogawa, K. H. 2001.
“Location and frequency cues in auditory selective attention,” J. Exp.
Psychol. Hum. Percept. Perform. 27, 65–74.
Wright, B. A., and Dai, H. 1994. “Detection of unexpected tones with
short and long durations,” J. Acoust. Soc. Am. 95, 931–938.
Wright, B. A., and Dai, H. 1998. “Detection of sinusoidal amplitude
modulation at unexpected rates,” J. Acoust. Soc. Am. 104, 2991–2997.
Yost, W. A. 1997. “The cocktail party problem: Forty years later,” in
Binaural and Spatial Hearing in Real and Virtual Environments, edited by
R. A. Gilkey and T. R. Anderson Erlbaum, Hillsdale, NJ, pp. 329–348.
Yost, W. A., Dye, R. H., and Sheft, S. 1996. “A simulated ‘cocktail party’
with up to three sound sources,” Percept. Psychophys. 58, 1026–1036.
Zattore, R. J. 1989. “Perceptual asymmetry on the dichotic fused words
test and cerebral speech lateralization determined by the carotoid sodium
amytal test,” Neuropsychologia 27, 1207–1219.
Zurek, P. M. 1993. “Binaural advantages and directional effects in speech
intelligibility,” in Acoustical Factors Affecting Hearing Aid Performance,
edited by G. A. Studebaker and I. Hochberg Allyn and Bacon, Boston,
pp. 255–276.
J. Acoust. Soc. Am., Vol. 118, No. 6, December 2005 Kidd, Jr. et al.: Knowing where to listen 3815
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 107.77.97.16 On: Tue, 05 Jan 2016 01:31:24

File (1)

Content uploaded by Frederick Gallun
Author content
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Arbogast et al. [1] found a large release from masking obtained by the spatial separation of a target talker and a competing speech masker. Both signal and masker were sentences from the Coordinate Response Measure corpus processed by extracting the envelopes of 15 narrow frequency bands and using the envelopes to modulate carrier tones at the center of each band. By playing nonoverlapping subsets (6–8) of bands from the signal and masker, the energetic component of masking was minimized while the informational component was maximized. The current study extended that work to determine the interaction between reverberation, masker type and spatial release from masking. The stimuli were processed in the same way and were presented in the same spatial configuration as the earlier study. The target sentence was presented at 0-deg azimuth while the masker sentence was played at either 0- or 90-deg azimuth. Noise-masker controls, comprised of overlapping or nonoverlapping frequency bands, were also tested. The listening environment was an IAC booth having interior dimensions of 12'4" × 13' × 7'6". Acoustic extremes were achieved by covering all surfaces with materials that were either highly reflective (Plexiglas® panels) or highly absorbent of sound (Silent Source® foam wedges). The results indicated that the amount of masking and the spatial release from masking depended both on the characteristics of the room and the masker type. When the masker was primarily energetic, spatial release from masking decreased from a maximum of about 8 dB in the least reverberant room to about 2 dB in the most reverberant room. For the informational masker, a larger advantage of 15–17 dB was found that was not affected by reverberation. Our interpretation of these findings was that spatial separation of sources could improve speech identification through acoustic filtering by the head, binaural interaction, and the strengthening of perceptual segregation of sound images. However, only the latter effect appears to be relatively insensitive to reverberation.
Article
Evidence is reviewed which suggests that there may be little or no direct introspective access to higher order cognitive processes. Subjects are sometimes (a) unaware of the existence of a stimulus that importantly influenced a response, (b) unaware of the existence of the response, and (c) unaware that the stimulus has affected the response. It is proposed that when people attempt to report on their cognitive processes, that is, on the processes mediating the effects of a stimulus on a response, they do not do so on the basis of any true introspection. Instead, their reports are based on a priori, implicit causal theories, or judgments about the extent to which a particular stimulus is a plausible cause of a given response. This suggests that though people may not be able to observe directly their cognitive processes, they will sometimes be able to report accurately about them. Accurate reports will occur when influential stimuli are salient and are plausible causes of the responses they produce, and will not occur when stimuli are not salient or are not plausible causes.
Chapter
Computational models in this chapter are defined to include models that lead to explicit, quantitative predictions for the phenomena that are being modeled. They may be posed purely in terms of the information that is available for the task, in which case the computed predictions are evaluated using information-theoretical or other statistical communication theory techniques, or they may be posed in terms of mechanisms or algorithms. Both types of computational models are included in this chapter. We do not include models that have been suggested but not evaluated or models which are not sufficiently explicit to allow precise predictions.
Article
Three experiments investigated the roles of interaural time differences (ITDs) and level differences (ILDs) in. spatial unmasking in multi-source environments. In experiment 1, speech reception thresholds (SRTs) were measured in virtual-acoustic simulations of an anechoic environment with three interfering sound sources of either speech or noise. The target source lay directly ahead, while three interfering sources were (1) all at the target's location (0degrees,0degrees,0'), (2) at locations distributed across both hemifields (-30degrees,60degrees,90degrees); (3) at locations in the same hemifield (30degrees,60degrees,90degrees), or (4) co-located in one hernifield (90degrees,90degrees,90degrees). Sounds were convolved with head-related impulse responses (HRIRs) that were manipulated to remove individual binaural cues. Three conditions used HRIRs with (1) both ILDs and ITDs, (2) only ILDs, and (3) only ITDs. The ITD-only condition produced the same pattern of results across spatial configurations as the combined cues, but with, smaller differences between spatial configurations. The ILD-only condition yielded similar SRTs for the (-30degrees,60degrees,90degrees) and (0degrees,0degrees,0degrees) configurations, as expected for best-ear listening. In experiment 2, pure-tone BMLDs were measured at third-octave frequencies against the ITD-only, speech-shaped noise interferers of experiment 1. These BMLDs were 4-8 dB at low frequencies for all spatial configurations. In experiment 3, SRTs were measured for speech in diotic, speech-shaped noise. Noises were filtered to reduce the spectrum level at each frequency according to the BMLDs measured in experiment 2. SRTs were as low or lower than those of the corresponding ITD-only conditions from experiment 1. Thus, an explanation of speech understanding in complex listening environments based on the combination of best-ear listening and binaural unmasking (without involving sound-localization) cannot be excluded. (C) 2004 Acoustical Society of America.
Article
By means of the Békésy up-and-down tracking technique the masked threshold of intelligibility of connected discourse with 0° azimuth was determined as a function of azimuth of a single competing sound source, with reverberation time (0 ≦ T ≦ 2.3 s) as a parameter. In half of the experiments the competing sound, too, was connected discourse, in the other half it was noise with the long-term average spectrum of speech. The masked threshold of intelligibility was measured binaurally and monaurally (one ear occluded) for ten listeners. Main results are: (1) the masked threshold of intelligibility is about 3 dB lower for speech than for noise as the competing sound; (2) in binaural hearing with T = 0, the gain of spatial divergence of the two sound sources is maximally about 5 dB; (3) the masked threshold in binaural hearing is about 2.5 dB lower than in monaural hearing with the competing sound source on the side of the occluded ear; (4) reverberation has a large effect on the masked threshold of intelligibility; (5) for normal-hearing subjects, the margin for discriminating speech is so small that, even without reverberation, subjects with unilateral deafness or sensorineural hearing impairment are seriously impeded by a single competing talker.
Article
This paper reviews the evidence relating lateral asymmetry in auditory perception to the asymmetrical functioning of the two hemispheres of the brain. Because each ear has greater neural representation in the opposite cerebral hemisphere, the predominance of the left hemispere for speech is reflected in superior recognition for words arriving at the right ear, while the predominance of the right hemisphere in melodic-pattern perception is reflected in superior identification of melodies arriving at the left ear. Some applications of the dichotic listening technique to questions concerned with the development of cerebral dominance, and with the further specification of function of the left and right hemispheres, are also described.
Article
Although there are several factors causing "cocktail party effect" after more than half a century of research, the major one is considered to be the spatial separation of the target signal and the interferer. This paper will overview developments of the improvement of performance resulting from the directional separation of the target signal from interferers when listening in a field or through headphones. The basic assumption concerning the cocktail party effect is that there are one or more interfering sound sources in addition to the target signal source. In this situation it is important to remember the selective attention effect, which attenuates the interfering sound by concentrating the attention on a specific signal. Pitch of sound is the simplest cue for selective attention; however, spatial information can also be one. The latter half of this review discusses the effect of spatial filtering and an attention filter on the frequency domain.
Article
This paper describes a number of objective experiments on recognition, concerning particularly the relation between the messages received by the two ears. Rather than use steady tones or clicks (frequency or time‐point signals) continuous speech is used, and the results interpreted in the main statistically. Two types of test are reported: (a) the behavior of a listener when presented with two speech signals simultaneously (statistical filtering problem) and (b) behavior when different speech signals are presented to his two ears.