ArticlePublisher preview available

Episodic encoding of voice attributes and recognition memory

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recognition memory for spoken words was investigated with a continuous recognition memory task. Independent variables were number of intervening words (lag) between initial and subsequent presentations of a word, total number of talkers in the stimulus set, and whether words were repeated in the same voice or a different voice. In Experiment 1, recognition judgments were based on word identity alone. Same-voice repetitions were recognized more quickly and accurately than different-voice repetitions at all values of lag and at all levels of talker variability. In Experiment 2, recognition judgments were based on both word identity and voice identity. Subjects recognized repeated voices quite accurately. Gender of the talker affected voice recognition but not item recognition. These results suggest that detailed information about a talker's voice is retained in long-term episodic memory representations of spoken words.
Journal of Experimental Psychology:
Learning, Memory, and Cognition
1993,
Vol. 19, No. 2, 309-328
Copyright 1993 by the American Psychological Association, Inc.
0278-7393/93/S3.00
Episodic Encoding of Voice Attributes and Recognition Memory
for Spoken Words
Thomas J. Palmeri, Stephen D. Goldinger, and David B. Pisoni
Recognition memory for spoken words was investigated with a continuous recognition memory
task. Independent variables were number of intervening words (lag) between initial and subse-
quent presentations of a word, total number of talkers in the stimulus set, and whether words were
repeated in the same voice or a different voice. In Experiment 1, recognition judgments were
based on word identity alone. Same-voice repetitions were recognized more quickly and accu-
rately than different-voice repetitions at all values of lag and at all levels of talker variability. In
Experiment 2, recognition judgments were based on both word identity and voice identity.
Subjects recognized repeated voices quite accurately. Gender of the talker affected voice recog-
nition but not item recognition. These results suggest that detailed information about a talker's
voice is retained in long-term episodic memory representations of spoken words.
The speech signal varies substantially across individual
talkers as a result of differences in the shape and length of
the vocal tract (Carrell, 1984; Fant, 1973; Summerfield &
Haggard, 1973), glottal source function (Carrell, 1984), po-
sitioning and control of articulators (Ladefoged, 1980), and
dialect. According to most contemporary theories of speech
perception, acoustic differences between talkers constitute
noise that must be somehow filtered out or transformed so
that the symbolic information encoded in the speech signal
may be recovered (e.g., Blandon, Henton, & Pickering,
1984;
Disner, 1980; Gerstman, 1968; Green, Kuhl, Melt-
zoff,
& Stevens, 1991; Summerfield & Haggard, 1973). In
these theories, some type of "talker-normalization" mecha-
nism, either implicit or explicit, is assumed to compensate
for the inherent talker variability1 in the speech signal (e.g.,
Joos,
1948). Although many theories attempt to describe
how idealized or abstract phonetic representations are re-
covered from the speech signal (see Johnson, 1990, and
Nearey, 1989, for reviews), little mention is made of
the
fate
of voice information after lexical access is complete. The
talker-normalization hypothesis is consistent with current
views of speech perception wherein acoustic-phonetic in-
variances are sought, redundant surface forms are quickly
forgotten, and only semantic information is retained in long-
term memory (see Pisoni, Lively, & Logan, 1992).
According to the traditional view of speech perception,
detailed information about a talker's voice is absent from
the representations of spoken utterances in memory. In fact,
Thomas J. Palmeri and David B. Pisoni, Department of Psychol-
ogy, Indiana University; Stephen D. Goldinger, Department of
Psychology, Arizona State University.
This research was supported by National Institutes of Health
Research Grant DC-00111-16 to Indiana University, Bloomington.
We thank Fergus Craik, Edward Geiselman, Leah Light, Scott
Lively, Lynne Nygaard, Mitch Sommers, and Richard Shiffrin for
their valuable comments and criticisms. We also thank Kristin
Lively for collecting data in Experiment 2.
Correspondence concerning this article should be addressed to
Thomas J. Palmeri or David B. Pisoni, Department of Psychology,
Indiana University, Bloomington, Indiana 47405.
evidence from a variety of tasks suggests that the surface
forms of both auditory and visual stimuli are retained in
memory. Using a continuous recognition memory task
(Shepard & Teghtsoonian, 1961), Craik and Kirsner (1974)
found that recognition memory for spoken words was better
when words were repeated in the same voice as that in
which they were originally presented. The enhanced recog-
nition of same-voice repetitions did not deteriorate over
increasing delays between repetitions. Moreover, subjects
were able to recognize whether a word was repeated in the
same voice as in its original presentation. When words were
presented visually, Kirsner (1973) found that recognition
memory was better for words that were presented and re-
peated in the same typeface. In a parallel to the auditory
data, subjects were also able to recognize whether a word
was repeated in the same typeface as in its original presen-
tation. Kirsner and Smith (1974) found similar results when
the presentation modalities of words, either visual or audi-
tory, were repeated.
Long-term memory for surface features of text has also
been demonstrated in several studies by Kolers and his
colleagues. Kolers and Ostry (1974) observed greater sav-
ings in reading times when subjects reread passages of
inverted text that were presented in the same inverted form
as an earlier presentation than when the same text was
presented in a different inverted form. This savings in read-
ing time was found even 1 year after the original presenta-
tion of the inverted text, although recognition memory for
the semantic content of the passages was reduced to chance
(Kolers, 1976). Together with the data from Kirsner and
colleagues, these findings suggest that physical forms of
auditory and visual stimuli are not filtered out during en-
coding but instead remain part of long-term memory repre-
sentations. In the domain of spoken language processing,
1 Talker variability refers to differences between talkers. All
references to talker variability and voice differences throughout
this article refer to such between-talker differences. Differences
between words produced by the same talker are not implied by this
term.
309
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
... It is difficult to overstate the influence of the series of studies examining talker-specificity effects in the 1990s (Goldinger, 1996;Luce and Lyons, 1998;Palmeri et al., 1993). In these studies, it was found that listeners were faster and more accurate at recognizing a word they had heard previously in an experiment when it was repeated in the same voice, rather than in the voice of a different talker, demonstrating that listeners stored fine-grained, talker-specific acoustic information in long-term memory. ...
... As an intermediate starting point, we conducted a continuous recognition memory experiment in Hindi, largely following the procedure described by Palmeri et al. (1993). Hindi was chosen as the language of study for several reasons. ...
... The specificity effect was only marginal in the response latency analysis. As in previous studies (Clapp et al., 2023a;Clapp et al., 2023b;Palmeri et al., 1993), participants were faster at responding at short than at long lags and faster when tokens were longer than when they were shorter. However, participants were not significantly faster to respond when a word was repeated in the SAME than in a DIFF voice, although the latency effect is associated with talker-specificity effects, as traditionally described (Goh, 2005;Goldinger, 1996;Palmeri et al., 1993). ...
Article
The discovery that listeners more accurately identify words repeated in the same voice than in a different voice has had an enormous influence on models of representation and speech perception. Widely replicated in English, we understand little about whether and how this effect generalizes across languages. In a continuous recognition memory study with Hindi speakers and listeners (N = 178), we replicated the talker-specificity effect for accuracy-based measures (hit rate and D′), and found the latency advantage to be marginal (p = 0.06). These data help us better understand talker-specificity effects cross-linguistically and highlight the importance of expanding work to less studied languages.
... Hence, when identifying a word, groups of episodic memory traces are activated to access the stored information about previous encounters with that word (Goldinger, 1998; for computational implementations of these ideas, see also Ans et al., 1998;Hintzman, 1986Hintzman, , 1988Reid et al., 2023). Indeed, in the literature on spoken word recognition, there is evidence that surface features of voice attributes are retained in memory traces for spoken information (e.g., see Clapp et al., 2023;Palmeri et al., 1993). Notably, instance accounts can easily explain why brand names are much more sensitive to perceptual factors than common words: the memory traces of brand names like amazon would contain distinct perceptual characteristics with little variability in their perceptual traces (see Rocabado et al., 2023). ...
... Conversely, words that appear in specific contexts and formats, such as brand names, would be functionally episodic, thus being more sensitive to perceptual effects. An advantage of the principles of the instance theory is that they hold in various areas of cognitive psychology, including associative learning, human memory, spoken word recognition (see Clapp et al., 2023;Palmeri et al., 1993), and language processing (see Jamieson et al., 2022, for a review), thus providing a highly comprehensive framework. ...
Article
Full-text available
While abstractionist theories of visual word recognition propose that perceptual elements like font and letter case are filtered out during lexical access, instance-based theories allow for the possibility that these surface details influence this process. To disentangle these accounts, we focused on brand names embedded in logotypes. The consistent visual presentation of brand names may render them much more susceptible to perceptual factors than common words. In the present study, we compared original and modified brand logos, varying in font or letter case. In Experiment 1, participants decided whether the stimuli corresponded to existing brand names or not, regardless of graphical information. In Experiment 2, participants had to categorize existing brand names semantically – whether they corresponded to a brand in the transportation sector or not. Both experiments showed longer response times for the modified brand names, regardless of font or letter-case changes. These findings challenge the notion that only abstract units drive visual word recognition. Instead, they favor those models that assume that, under some circumstances, the traces in lexical memory may contain surface perceptual information.
... Our data also shed light on the processes involved in generating predictions. Cueing the speaker's face resulted in faster RTs only for predictable words, suggesting that this effect cannot be solely attributed to the priming of talker-specific representations (Creel et al., 2008;Creel & Bregman, 2011;Goldinger, 1996;Nygaard & Pisoni, 1998;Palmeri et al., 1993;Remez et al., 1997). Rather, our result appears to be specific to prediction processes based on sentential constraints. ...
Article
Full-text available
Most models of language comprehension assume that the linguistic system is able to pre-activate phonological information. However, the evidence for phonological prediction is mixed and controversial. In this study, we implement a paradigm that capitalizes on the fact that foreign speakers usually make phonological errors. We investigate whether speaker identity (native vs. foreign) is used to make specific phonological predictions. Fifty-two participants were recruited to read sentence frames followed by a last spoken word which was uttered by either a native or a foreign speaker. They were required to perform a lexical decision on the last spoken word, which could be either semantically predictable or not. Speaker identity (native vs. foreign) may or may not be cued by the face of the speaker. We observed that the face cue is effective in speeding up the lexical decision when the word is predictable, but it is not effective when the word is not predictable. This result shows that speech prediction takes into account the phonological variability between speakers, suggesting that it is possible to pre-activate in a detailed and specific way the phonological representation of a predictable word.
... Exemplar theory proposes that episodic traces are encoded in the lexicon (Goldinger, 1998;Hintzman, 1984;Johnson, 1997Johnson, , 2006Pierrehumbert, 2001). Many researchers have suggested that nonauditory factors, such as characteristics of the speaker, are also stored with these exemplars (see work on talker familiarity effects: Craik & Kirsner, 1974;Magnuson et al., 2021;Newman & Evers, 2007;Palmeri et al., 1993). Over time, listeners may create abstracted categories, systematically linking social groupings and phonetic patterns (as proposed by Melguy & Johnson, 2021). ...
Article
Full-text available
Prior research has shown that visual information, such as a speaker’s perceived race or ethnicity, prompts listeners to expect a specific sociophonetic pattern (“social priming”). Indeed, a picture of an East Asian face may facilitate perception of second language (L2) Mandarin Chinese-accented English but interfere with perception of first language- (L1-) accented English. The present study builds on this line of inquiry, addressing the relationship between social priming effects and implicit racial/ethnic associations for L1- and L2-accented speech. For L1-accented speech, we found no priming effects when comparing White versus East Asian or Latina primes. For L2- (Mandarin Chinese-) accented speech, however, transcription accuracy was slightly better following an East Asian prime than a White prime. Across all experiments, a relationship between performance and individual differences in implicit associations emerged, but in no cases did this relationship interact with the priming manipulation. Ultimately, exploring social priming effects with additional methodological approaches, and in different populations of listeners, will help to determine whether these effects operate differently in the context of L1- and L2-accented speech.
... These views are not necessarily contradictory. Talker-specific properties may be encoded in representations, creating more associative hooks (Barcroft & Sommers, 2005, 2014Goldinger, 1998) and aiding subsequent speech processing, as evidenced by improved identification of tokens produced by familiar talkers (e.g., Nygaard et al., 1994;Palmeri et al., 1993). Furthermore, clusters of consistent information, such as phonetic units, may naturally emerge from the encoded and stored idiosyncratic talker-related attributes. ...
Article
Talker variability has been reported to facilitate generalization and retention of speech learning, but is also shown to place demands on cognitive resources. Our recent study provided evidence that phonetically-irrelevant acoustic variability in single-talker (ST) speech is sufficient to induce equivalent amounts of learning to the use of multiple-talker (MT) training. This study is a follow-up contrasting MT versus ST training with varying degrees of temporal exaggeration to examine how cognitive measures of individual learners may influence the role of input variability in immediate learning and long-term retention. Native Chinese-speaking adults were trained on the English /i/-/ɪ/ contrast. We assessed the trainees' working memory and inhibition control before training. The two trained groups showed comparable long-term retention of training effects in terms of word identification performance and more native-like cue weighting in both perception and production regardless of talker variability condition. The results demonstrate the role of phonetically-irrelevant variability in robust speech learning and modulatory functions of nonlinguistic domain-general inhibitory control and working memory, highlighting the necessity to consider the interaction between input characteristics, task difficulty, and individual differences in cognitive abilities in assessing learning outcomes.
... Despite the prominence of talker variability in the prior literature, there is more limited evidence that some (though not all) other sources of variability affect speech perception. For instance, while listeners are less likely to recognize that they had heard a word previously if it is spoken by a different talker (Palmeri et al., 1993), they are also less likely to recognize that they had heard a word before if it is spoken by the same talker but at a different rate (Bradlow et al., 1999). Trial-by-trial variability in speech rate is similarly deleterious for on-line speech identification accuracy (Sommers and Barcroft, 2006;Uchanski et al., 1992) and speed (Newman et al., 2001). ...
Article
Phonetic variability across talkers imposes additional processing costs during speech perception, evident in performance decrements when listening to speech from multiple talkers. However, within-talker phonetic variation is a less well-understood source of variability in speech, and it is unknown how processing costs from within-talker variation compare to those from between-talker variation. Here, listeners performed a speeded word identification task in which three dimensions of variability were factorially manipulated: between-talker variability (single vs multiple talkers), within-talker variability (single vs multiple acoustically distinct recordings per word), and word-choice variability (two- vs six-word choices). All three sources of variability led to reduced speech processing efficiency. Between-talker variability affected both word-identification accuracy and response time, but within-talker variability affected only response time. Furthermore, between-talker variability, but not within-talker variability, had a greater impact when the target phonological contrasts were more similar. Together, these results suggest that natural between- and within-talker variability reflect two distinct magnitudes of common acoustic–phonetic variability: Both affect speech processing efficiency, but they appear to have qualitatively and quantitatively unique effects due to differences in their potential to obscure acoustic–phonemic correspondences across utterances.
Preprint
Full-text available
Memorability, the likelihood that a stimulus is remembered, is an intrinsic stimulus property that is highly consistent across people—participants tend to remember and forget the same faces, objects, and more. However, these consistencies in memory have thus far only been observed for visual stimuli. We provide the first study of auditory memorability, collecting recognition memory scores from over 3000 participants listening to a sequence of different speakers saying the same sentence. We found significant consistency across participants in their memory for voice clips and for speakers across different utterances. Next, we tested regression models incorporating both low-level (e.g., fundamental frequency) and high-level (e.g., dialect) voice properties to predict their memorability. These models were significantly predictive, and cross-validated out-of-sample, supporting an inherent memorability of speakers’ voices. These results provide the first evidence that listeners are similar in the voices they remember, which can be reliably predicted by quantifiable voice features.
Article
Full-text available
Listeners readily adapt to variation in non-native-accented speech, learning to disambiguate between talker-specific and accent-based variation. We asked (1) which linguistic and indexical features of the spoken utterance are relevant for this learning to occur and (2) whether task-driven attention to these features affects the extent to which learning generalizes to novel utterances and voices. In two experiments, listeners heard English sentences (Experiment 1) or words (Experiment 2) produced by Spanish-accented talkers during an exposure phase. Listeners' attention was directed to lexical content (transcription), indexical cues (talker identification), or both (transcription + talker identification). In Experiment 1, listeners' test transcription of novel English sentences spoken by Spanish-accented talkers showed generalized perceptual learning to previously unheard voices and utterances for all training conditions. In Experiment 2, generalized learning occurred only in the transcription + talker identification condition, suggesting that attention to both linguistic and indexical cues optimizes listeners’ ability to distinguish between individual talker- and group-based variation, especially with the reduced availability of sentence-length prosodic information. Collectively, these findings highlight the role of attentional processes in the encoding of speech input and underscore the interdependency of indexical and lexical characteristics in spoken language processing.
Article
Listeners use more than just acoustic information when processing speech. Social information, such as a speaker’s perceived race or ethnicity, can also affect the processing of the speech signal, in some cases facilitating perception (“social priming”). We aimed to replicate and extend this line of inquiry, examining effects of multiple social primes (i.e., a Middle Eastern, White, or East Asian face, or a control silhouette image) on the perception of Mandarin Chinese-accented English and Arabic-accented English. By including uncommon priming combinations (e.g., a Middle Eastern prime for a Mandarin accent), we aimed to test the specificity of social primes: For example, can a Middle Eastern face facilitate perception of both Arabic-accented English and Mandarin-accented English? Contrary to our predictions, our results indicated no facilitative social priming effects for either of the second language (L2) accents. Results for our examination of specificity were mixed. Trends in the data indicated that the combination of an East Asian prime with Arabic accent resulted in lower accuracy as compared with a White prime, but the combination of a Middle Eastern prime with a Mandarin accent did not (and may have actually benefited listeners to some degree). We conclude that the specificity of priming effects may depend on listeners’ level of familiarity with a given accent and/or racial/ethnic group and that the mixed outcomes in the current work motivate further inquiries to determine whether social priming effects for L2-accented speech may be smaller than previously hypothesized and/or highly dependent on listener experience.
Article
Previous studies have suggested the effect of linguistic information on voice perception (e.g., the language-familiarity effect [LFE]). However, it remains unclear which type of specific information in speech contributes to voice perception, including acoustic, phonological, lexical, and semantic information. It is also underexamined whether the roles of these different types of information are modulated by the experimental paradigm (speaker discrimination vs. speaker identification). In this study, we conducted two experiments to investigate these issues regarding LFEs. Experiment 1 examined the roles of acoustic and phonological information in speaker discrimination and identification with forward and time-reversed Mandarin and Indonesian sentences. Experiment 2 further identified the roles of phonological, lexical, and semantic information with forward, word-scrambled, and reconstructed (consisting of pseudo-Mandarin words) Mandarin and forward Indonesian sentences. For Mandarin-only participants, in Experiment 1, speaker discrimination was more accurate for forward than reversed sentences, but there was no LFE in either sentence. Speaker identification was also more accurate for forward than reversed sentences, whereas there was an LFE for forward sentences. In Experiment 2, speaker discrimination was better for word-scrambled than reconstructed Mandarin sentences. Speaker identification was more accurate for forward and word-scrambled Mandarin sentences but less accurate for Mandarin reconstructed and forward Indonesian sentences. In general, the pattern of the results for Indonesian learners was the same as that for Mandarin-only speakers. These results suggest that different kinds of information support speaker discrimination and identification in native and unfamiliar languages. The LFE in speaker identification depends on both phonological and lexical information.
Article
Full-text available
The accuracy and response latency of yes/no recognition decisions were measured in 3 experiments by the continuous recognition paradigm. Ss were 12 right-handed undergraduates. Lag––the number of intervening items between target presentations––was varied from 0 to 40. A logarithmic function provided a good description of the relation between lag and correct response latency. Item repetition affected the intercept of the logarithmic functions, with little effect on the slope. A noun/nonnoun stimulus manipulation affected the slope of the functions with no appreciable effect on the intercept. The latter result was obtained both for once- and twice-repeated item functions, and with the stimulus manipulation was both a between- and a within-lists variable. Results are incompatible with R. C. Atkinson and J. F. Juola's (1973) model and B. B. Murdock's (1974) conveyor-belt model. The retrieval theory of R. Ratcliff (1978) and the multiple-observations model of R. Pike et al (1977) provide the most satisfactory account of the present results. However, both models may have difficulty in accounting for the obtained repetition effects. (25 ref)
Article
Full-text available
Forty Ss were given a continuous recognition memory test in which each word was presented twice, either in the same print or in different print on the two occasions. The results showed that (a) recognition performance was facilitated to a small but statistically significant extent in the same-print condition and that (b) Ss could reliably report first presentation print for recognized items for at least 1(1/2) rain. In a second experiment, the stimuli used were nonsense strings of. from five to seven letters instead of words. This manipulation increased the same-print advantage in recognition but reduced Ss' ability to report first print form. The results indicate that information about the physical features of verbal stimuli is retained in a visual code that is partially' or wholly independent of the verbal code for the same stimuli. The results are inconsistent with the conclusion that the visual code is stored only as a dependent attribute of the verbal code in memory.
Article
Full-text available
An experiment was designed to investigate the locus of persistence of information about presentation modality for verbal stimuli. Twenty-four Ss were presented with a continuous series of 672 letter sequences for word/nonword categorization. The sequences were divided equally between words and nonwords, and each item was presented twice in the series, either in the same or in a different modality. Repetition facilitation, the advantage resulting from a second presentation, was greatest in the intramodality conditions for both words (+re responses) and nonwords (-ve responses). Facilitation in these conditions declined from 170 msec at Lag 0 (4 sec) to approximately 40 msec at Lag 63. Facilitation was reduced in the cross-modality condition for words and was absent from the cross-modality condition for nonwords. The modality-specific component of the repetition effect found in the word/nonword categorization paradigm may be attributed to persistence in the nonlexical, as distinct from lexical, component of the word categorization process.
Article
Full-text available
Geiselman and Bellezza (1976) concluded that any retention in memory of the sex of a speaker of verbal material is automatic. Two possible reasons for this were hypothesized: the voice-connotation hypothesis and the dual-hemisphere parallel-processing hypothesis. In Experiment 1, the to-be-remembered sentences contained either male or female agents. Incidental retention of sex of speaker did not occur. This result does not support the dual-hemisphere parallel-processing hypothesis, which indicates that retention of voice should be independent of sentence content. In Experiment 2, the sentences contained neutral agents and incidental retention of sex of speaker did occur. The results of Experiments 1 and 2 support the connotation hypothesis. The different results with regard to incidental retention of speakers's voice found in Experiments 1 and 2 were replicated in Experiment 3 using a within-subjects design. Experimemt 4 was conducted to determine if a speaker's voice does, in fact, influence the meaning of a neutral sentence. In agreement with the voice-connotation hypothesis, sentences spoken by a male were rated as having more "potent" connotations than sentences spoken by a female.
Chapter
Variations in Vocal Tract Size between speakers are reflected in the acoustic characteristics of their speech but are largely normalised out in perception. Can such normalisation be measured as an additional stage in speech perception?
Article
The development and evaluation of a new speech‐intelligibility test suitable for routine use by operational personnel in determining the performance level of speech‐communication systems is described. The format used is similar to that described for a rhyme test but makes use of a closed‐response set. An experiment was performed to determine the general reliability of the test materials when administered to U.S. Air Force enlisted personnel under a wide range of signal‐to‐noise ratios. Testing of 18 listeners over a period of 30 days showed that repeated exposure to the materials did not change the levels of average response in any appreciable way. Analysis of the responses to individual phonetic elements shows that the test can be useful for diagnostic study as well as for over‐all evaluation of communication systems. Talker differences that appeared during the experiment and the statistical reliability and sensitivity of the materials are analyzed and discussed.
Article
Two experiments demonstrated that Ss are capable of making within-modality memory discriminations in both visual and auditory modalities. In Experiment I Ss studied mixed lists of pictures and labels representing common objects and were subsequently required to judge whether the original presentation was pictorial or verbal The high level of performance achieved on this task was unaffected by degree of categorical relatedness of items within method of presentation or by instructions to produce visual images when items were presented verbally. In Experiment II Ss demonstrated the ability to remember whether a sentence was originally presented by a male or a female speaker. Some strategies by which within-modality discrimination in memory might be accomplished are discussed.