ArticlePublisher preview available

Episodic encoding of voice attributes and recognition memory

March 1993
Journal of Experimental Psychology Learning Memory and Cognition 19(2):309-328

March 1993
19(2):309-328

DOI:10.1037/0278-7393.19.2.309

Authors:

Arizona State University

Recognition memory for spoken words was investigated with a continuous recognition memory task. Independent variables were number of intervening words (lag) between initial and subsequent presentations of a word, total number of talkers in the stimulus set, and whether words were repeated in the same voice or a different voice. In Experiment 1, recognition judgments were based on word identity alone. Same-voice repetitions were recognized more quickly and accurately than different-voice repetitions at all values of lag and at all levels of talker variability. In Experiment 2, recognition judgments were based on both word identity and voice identity. Subjects recognized repeated voices quite accurately. Gender of the talker affected voice recognition but not item recognition. These results suggest that detailed information about a talker's voice is retained in long-term episodic memory representations of spoken words.

A preview of this full-text is provided by American Psychological Association.

Learn more

Content available from Journal of Experimental Psychology Learning Memory and Cognition

This content is subject to copyright. Terms and conditions apply.

Journal of Experimental Psychology:

Learning, Memory, and Cognition

1993,

Vol. 19, No. 2, 309-328

0278-7393/93/S3.00

Episodic Encoding of Voice Attributes and Recognition Memory

for Spoken Words

Thomas J. Palmeri, Stephen D. Goldinger, and David B. Pisoni

Recognition memory for spoken words was investigated with a continuous recognition memory

task. Independent variables were number of intervening words (lag) between initial and subse-

quent presentations of a word, total number of talkers in the stimulus set, and whether words were

repeated in the same voice or a different voice. In Experiment 1, recognition judgments were

based on word identity alone. Same-voice repetitions were recognized more quickly and accu-

rately than different-voice repetitions at all values of lag and at all levels of talker variability. In

Experiment 2, recognition judgments were based on both word identity and voice identity.

Subjects recognized repeated voices quite accurately. Gender of the talker affected voice recog-

nition but not item recognition. These results suggest that detailed information about a talker's

voice is retained in long-term episodic memory representations of spoken words.

The speech signal varies substantially across individual

talkers as a result of differences in the shape and length of

the vocal tract (Carrell, 1984; Fant, 1973; Summerfield &

Haggard, 1973), glottal source function (Carrell, 1984), po-

sitioning and control of articulators (Ladefoged, 1980), and

dialect. According to most contemporary theories of speech

perception, acoustic differences between talkers constitute

noise that must be somehow filtered out or transformed so

that the symbolic information encoded in the speech signal

may be recovered (e.g., Blandon, Henton, & Pickering,

1984;

Disner, 1980; Gerstman, 1968; Green, Kuhl, Melt-

zoff,

& Stevens, 1991; Summerfield & Haggard, 1973). In

these theories, some type of "talker-normalization" mecha-

nism, either implicit or explicit, is assumed to compensate

for the inherent talker variability1 in the speech signal (e.g.,

Joos,

1948). Although many theories attempt to describe

how idealized or abstract phonetic representations are re-

covered from the speech signal (see Johnson, 1990, and

Nearey, 1989, for reviews), little mention is made of

the

fate

of voice information after lexical access is complete. The

talker-normalization hypothesis is consistent with current

views of speech perception wherein acoustic-phonetic in-

variances are sought, redundant surface forms are quickly

forgotten, and only semantic information is retained in long-

term memory (see Pisoni, Lively, & Logan, 1992).

According to the traditional view of speech perception,

detailed information about a talker's voice is absent from

the representations of spoken utterances in memory. In fact,

Thomas J. Palmeri and David B. Pisoni, Department of Psychol-

ogy, Indiana University; Stephen D. Goldinger, Department of

Psychology, Arizona State University.

This research was supported by National Institutes of Health

Research Grant DC-00111-16 to Indiana University, Bloomington.

We thank Fergus Craik, Edward Geiselman, Leah Light, Scott

Lively, Lynne Nygaard, Mitch Sommers, and Richard Shiffrin for

their valuable comments and criticisms. We also thank Kristin

Lively for collecting data in Experiment 2.

Correspondence concerning this article should be addressed to

Thomas J. Palmeri or David B. Pisoni, Department of Psychology,

Indiana University, Bloomington, Indiana 47405.

evidence from a variety of tasks suggests that the surface

forms of both auditory and visual stimuli are retained in

memory. Using a continuous recognition memory task

(Shepard & Teghtsoonian, 1961), Craik and Kirsner (1974)

found that recognition memory for spoken words was better

when words were repeated in the same voice as that in

which they were originally presented. The enhanced recog-

nition of same-voice repetitions did not deteriorate over

increasing delays between repetitions. Moreover, subjects

were able to recognize whether a word was repeated in the

same voice as in its original presentation. When words were

presented visually, Kirsner (1973) found that recognition

memory was better for words that were presented and re-

peated in the same typeface. In a parallel to the auditory

data, subjects were also able to recognize whether a word

was repeated in the same typeface as in its original presen-

tation. Kirsner and Smith (1974) found similar results when

the presentation modalities of words, either visual or audi-

tory, were repeated.

Long-term memory for surface features of text has also

been demonstrated in several studies by Kolers and his

colleagues. Kolers and Ostry (1974) observed greater sav-

ings in reading times when subjects reread passages of

inverted text that were presented in the same inverted form

as an earlier presentation than when the same text was

presented in a different inverted form. This savings in read-

ing time was found even 1 year after the original presenta-

tion of the inverted text, although recognition memory for

the semantic content of the passages was reduced to chance

(Kolers, 1976). Together with the data from Kirsner and

colleagues, these findings suggest that physical forms of

auditory and visual stimuli are not filtered out during en-

coding but instead remain part of long-term memory repre-

sentations. In the domain of spoken language processing,

1 Talker variability refers to differences between talkers. All

references to talker variability and voice differences throughout

this article refer to such between-talker differences. Differences

between words produced by the same talker are not implied by this

term.

309

This document is copyrighted by the American Psychological Association or one of its allied publishers.

This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

The episodic encoding of spoken words in Hindi

Article

Mar 2024

The discovery that listeners more accurately identify words repeated in the same voice than in a different voice has had an enormous influence on models of representation and speech perception. Widely replicated in English, we understand little about whether and how this effect generalizes across languages. In a continuous recognition memory study with Hindi speakers and listeners (N = 178), we replicated the talker-specificity effect for accuracy-based measures (hit rate and D′), and found the latency advantage to be marginal (p = 0.06). These data help us better understand talker-specificity effects cross-linguistically and highlight the importance of expanding work to less studied languages.

Visual word identification beyond common words: The role of font and letter case in brand names

Article

Full-text available

May 2024

While abstractionist theories of visual word recognition propose that perceptual elements like font and letter case are filtered out during lexical access, instance-based theories allow for the possibility that these surface details influence this process. To disentangle these accounts, we focused on brand names embedded in logotypes. The consistent visual presentation of brand names may render them much more susceptible to perceptual factors than common words. In the present study, we compared original and modified brand logos, varying in font or letter case. In Experiment 1, participants decided whether the stimuli corresponded to existing brand names or not, regardless of graphical information. In Experiment 2, participants had to categorize existing brand names semantically – whether they corresponded to a brand in the transportation sector or not. Both experiments showed longer response times for the modified brand names, regardless of font or letter-case changes. These findings challenge the notion that only abstract units drive visual word recognition. Instead, they favor those models that assume that, under some circumstances, the traces in lexical memory may contain surface perceptual information.

I know how you’ll say it: evidence of speaker-specific speech prediction

Article

Full-text available

Mar 2024
PSYCHON B REV

Most models of language comprehension assume that the linguistic system is able to pre-activate phonological information. However, the evidence for phonological prediction is mixed and controversial. In this study, we implement a paradigm that capitalizes on the fact that foreign speakers usually make phonological errors. We investigate whether speaker identity (native vs. foreign) is used to make specific phonological predictions. Fifty-two participants were recruited to read sentence frames followed by a last spoken word which was uttered by either a native or a foreign speaker. They were required to perform a lexical decision on the last spoken word, which could be either semantically predictable or not. Speaker identity (native vs. foreign) may or may not be cued by the face of the speaker. We observed that the face cue is effective in speeding up the lexical decision when the word is predictable, but it is not effective when the word is not predictable. This result shows that speech prediction takes into account the phonological variability between speakers, suggesting that it is possible to pre-activate in a detailed and specific way the phonological representation of a predictable word.

Social Priming of Speech Perception: The Role of Individual Differences in Implicit Racial and Ethnic Associations

Article

Full-text available

Feb 2024

Prior research has shown that visual information, such as a speaker’s perceived race or ethnicity, prompts listeners to expect a specific sociophonetic pattern (“social priming”). Indeed, a picture of an East Asian face may facilitate perception of second language (L2) Mandarin Chinese-accented English but interfere with perception of first language- (L1-) accented English. The present study builds on this line of inquiry, addressing the relationship between social priming effects and implicit racial/ethnic associations for L1- and L2-accented speech. For L1-accented speech, we found no priming effects when comparing White versus East Asian or Latina primes. For L2- (Mandarin Chinese-) accented speech, however, transcription accuracy was slightly better following an East Asian prime than a White prime. Across all experiments, a relationship between performance and individual differences in implicit associations emerged, but in no cases did this relationship interact with the priming manipulation. Ultimately, exploring social priming effects with additional methodological approaches, and in different populations of listeners, will help to determine whether these effects operate differently in the context of L1- and L2-accented speech.

Cognitive factors in nonnative phonetic learning: Impacts of inhibitory control and working memory on the benefits and costs of talker variability

Article

Aug 2023
J PHONETICS

Talker variability has been reported to facilitate generalization and retention of speech learning, but is also shown to place demands on cognitive resources. Our recent study provided evidence that phonetically-irrelevant acoustic variability in single-talker (ST) speech is sufficient to induce equivalent amounts of learning to the use of multiple-talker (MT) training. This study is a follow-up contrasting MT versus ST training with varying degrees of temporal exaggeration to examine how cognitive measures of individual learners may influence the role of input variability in immediate learning and long-term retention. Native Chinese-speaking adults were trained on the English /i/-/ɪ/ contrast. We assessed the trainees' working memory and inhibition control before training. The two trained groups showed comparable long-term retention of training effects in terms of word identification performance and more native-like cue weighting in both perception and production regardless of talker variability condition. The results demonstrate the role of phonetically-irrelevant variability in robust speech learning and modulatory functions of nonlinguistic domain-general inhibitory control and working memory, highlighting the necessity to consider the interaction between input characteristics, task difficulty, and individual differences in cognitive abilities in assessing learning outcomes.

Multiple sources of acoustic variation affect speech processing efficiency

Article

Jan 2023

Phonetic variability across talkers imposes additional processing costs during speech perception, evident in performance decrements when listening to speech from multiple talkers. However, within-talker phonetic variation is a less well-understood source of variability in speech, and it is unknown how processing costs from within-talker variation compare to those from between-talker variation. Here, listeners performed a speeded word identification task in which three dimensions of variability were factorially manipulated: between-talker variability (single vs multiple talkers), within-talker variability (single vs multiple acoustically distinct recordings per word), and word-choice variability (two- vs six-word choices). All three sources of variability led to reduced speech processing efficiency. Between-talker variability affected both word-identification accuracy and response time, but within-talker variability affected only response time. Furthermore, between-talker variability, but not within-talker variability, had a greater impact when the target phonological contrasts were more similar. Together, these results suggest that natural between- and within-talker variability reflect two distinct magnitudes of common acoustic–phonetic variability: Both affect speech processing efficiency, but they appear to have qualitatively and quantitatively unique effects due to differences in their potential to obscure acoustic–phonemic correspondences across utterances.

The Memorability of Voices is Predictable and Consistent across Listeners

Preprint

Full-text available

Feb 2024

Memorability, the likelihood that a stimulus is remembered, is an intrinsic stimulus property that is highly consistent across people—participants tend to remember and forget the same faces, objects, and more. However, these consistencies in memory have thus far only been observed for visual stimuli. We provide the first study of auditory memorability, collecting recognition memory scores from over 3000 participants listening to a sequence of different speakers saying the same sentence. We found significant consistency across participants in their memory for voice clips and for speakers across different utterances. Next, we tested regression models incorporating both low-level (e.g., fundamental frequency) and high-level (e.g., dialect) voice properties to predict their memorability. These models were significantly predictive, and cross-validated out-of-sample, supporting an inherent memorability of speakers’ voices. These results provide the first evidence that listeners are similar in the voices they remember, which can be reliably predicted by quantifiable voice features.

Attention modulates perceptual learning of non-native-accented speech

Article

Full-text available

Oct 2023

Listeners readily adapt to variation in non-native-accented speech, learning to disambiguate between talker-specific and accent-based variation. We asked (1) which linguistic and indexical features of the spoken utterance are relevant for this learning to occur and (2) whether task-driven attention to these features affects the extent to which learning generalizes to novel utterances and voices. In two experiments, listeners heard English sentences (Experiment 1) or words (Experiment 2) produced by Spanish-accented talkers during an exposure phase. Listeners' attention was directed to lexical content (transcription), indexical cues (talker identification), or both (transcription + talker identification). In Experiment 1, listeners' test transcription of novel English sentences spoken by Spanish-accented talkers showed generalized perceptual learning to previously unheard voices and utterances for all training conditions. In Experiment 2, generalized learning occurred only in the transcription + talker identification condition, suggesting that attention to both linguistic and indexical cues optimizes listeners’ ability to distinguish between individual talker- and group-based variation, especially with the reduced availability of sentence-length prosodic information. Collectively, these findings highlight the role of attentional processes in the encoding of speech input and underscore the interdependency of indexical and lexical characteristics in spoken language processing.

Social Priming: Exploring the Effects of Speaker Race and Ethnicity on Perception of Second Language Accents

Article

Sep 2023

Listeners use more than just acoustic information when processing speech. Social information, such as a speaker’s perceived race or ethnicity, can also affect the processing of the speech signal, in some cases facilitating perception (“social priming”). We aimed to replicate and extend this line of inquiry, examining effects of multiple social primes (i.e., a Middle Eastern, White, or East Asian face, or a control silhouette image) on the perception of Mandarin Chinese-accented English and Arabic-accented English. By including uncommon priming combinations (e.g., a Middle Eastern prime for a Mandarin accent), we aimed to test the specificity of social primes: For example, can a Middle Eastern face facilitate perception of both Arabic-accented English and Mandarin-accented English? Contrary to our predictions, our results indicated no facilitative social priming effects for either of the second language (L2) accents. Results for our examination of specificity were mixed. Trends in the data indicated that the combination of an East Asian prime with Arabic accent resulted in lower accuracy as compared with a White prime, but the combination of a Middle Eastern prime with a Mandarin accent did not (and may have actually benefited listeners to some degree). We conclude that the specificity of priming effects may depend on listeners’ level of familiarity with a given accent and/or racial/ethnic group and that the mixed outcomes in the current work motivate further inquiries to determine whether social priming effects for L2-accented speech may be smaller than previously hypothesized and/or highly dependent on listener experience.

How Different Types of Linguistic Information Impact Voice Perception: Evidence From the Language-Familiarity Effect

Article

Jan 2023

Previous studies have suggested the effect of linguistic information on voice perception (e.g., the language-familiarity effect [LFE]). However, it remains unclear which type of specific information in speech contributes to voice perception, including acoustic, phonological, lexical, and semantic information. It is also underexamined whether the roles of these different types of information are modulated by the experimental paradigm (speaker discrimination vs. speaker identification). In this study, we conducted two experiments to investigate these issues regarding LFEs. Experiment 1 examined the roles of acoustic and phonological information in speaker discrimination and identification with forward and time-reversed Mandarin and Indonesian sentences. Experiment 2 further identified the roles of phonological, lexical, and semantic information with forward, word-scrambled, and reconstructed (consisting of pseudo-Mandarin words) Mandarin and forward Indonesian sentences. For Mandarin-only participants, in Experiment 1, speaker discrimination was more accurate for forward than reversed sentences, but there was no LFE in either sentence. Speaker identification was also more accurate for forward than reversed sentences, whereas there was an LFE for forward sentences. In Experiment 2, speaker discrimination was better for word-scrambled than reconstructed Mandarin sentences. Speaker identification was more accurate for forward and word-scrambled Mandarin sentences but less accurate for Mandarin reconstructed and forward Indonesian sentences. In general, the pattern of the results for Indonesian learners was the same as that for Mandarin-only speakers. These results suggest that different kinds of information support speaker discrimination and identification in native and unfamiliar languages. The LFE in speaker identification depends on both phonological and lexical information.

Retrieval processes in continuous recognition

Article

Full-text available

Nov 1982

William E. Hockley

The accuracy and response latency of yes/no recognition decisions were measured in 3 experiments by the continuous recognition paradigm. Ss were 12 right-handed undergraduates. Lag––the number of intervening items between target presentations––was varied from 0 to 40. A logarithmic function provided a good description of the relation between lag and correct response latency. Item repetition affected the intercept of the logarithmic functions, with little effect on the slope. A noun/nonnoun stimulus manipulation affected the slope of the functions with no appreciable effect on the intercept. The latter result was obtained both for once- and twice-repeated item functions, and with the stimulus manipulation was both a between- and a within-lists variable. Results are incompatible with R. C. Atkinson and J. F. Juola's (1973) model and B. B. Murdock's (1974) conveyor-belt model. The retrieval theory of R. Ratcliff (1978) and the multiple-observations model of R. Pike et al (1977) provide the most satisfactory account of the present results. However, both models may have difficulty in accounting for the obtained repetition effects. (25 ref)

An analysis of the visual component in recognition memory for verbal stimuli

Article

Full-text available

Dec 1973
MEM COGNITION

Kim Kirsner

Forty Ss were given a continuous recognition memory test in which each word was presented twice, either in the same print or in different print on the two occasions. The results showed that (a) recognition performance was facilitated to a small but statistically significant extent in the same-print condition and that (b) Ss could reliably report first presentation print for recognized items for at least 1(1/2) rain. In a second experiment, the stimuli used were nonsense strings of. from five to seven letters instead of words. This manipulation increased the same-print advantage in recognition but reduced Ss' ability to report first print form. The results indicate that information about the physical features of verbal stimuli is retained in a visual code that is partially' or wholly independent of the verbal code for the same stimuli. The results are inconsistent with the conclusion that the visual code is stored only as a dependent attribute of the verbal code in memory.

Modality effects in word identification

Article

Full-text available

Jul 1974

An experiment was designed to investigate the locus of persistence of information about presentation modality for verbal stimuli. Twenty-four Ss were presented with a continuous series of 672 letter sequences for word/nonword categorization. The sequences were divided equally between words and nonwords, and each item was presented twice in the series, either in the same or in a different modality. Repetition facilitation, the advantage resulting from a second presentation, was greatest in the intramodality conditions for both words (+re responses) and nonwords (-ve responses). Facilitation in these conditions declined from 170 msec at Lag 0 (4 sec) to approximately 40 msec at Lag 63. Facilitation was reduced in the cross-modality condition for words and was absent from the cross-modality condition for nonwords. The modality-specific component of the repetition effect found in the word/nonword categorization paradigm may be attributed to persistence in the nonlexical, as distinct from lexical, component of the word categorization process.

Incidental retention of speaker's voice

Article

Full-text available

Nov 1977

Geiselman and Bellezza (1976) concluded that any retention in memory of the sex of a speaker of verbal material is automatic. Two possible reasons for this were hypothesized: the voice-connotation hypothesis and the dual-hemisphere parallel-processing hypothesis. In Experiment 1, the to-be-remembered sentences contained either male or female agents. Incidental retention of sex of speaker did not occur. This result does not support the dual-hemisphere parallel-processing hypothesis, which indicates that retention of voice should be independent of sentence content. In Experiment 2, the sentences contained neutral agents and incidental retention of sex of speaker did occur. The results of Experiments 1 and 2 support the connotation hypothesis. The different results with regard to incidental retention of speakers's voice found in Experiments 1 and 2 were replicated in Experiment 3 using a within-subjects design. Experimemt 4 was conducted to determine if a speaker's voice does, in fact, influence the meaning of a neutral sentence. In agreement with the voice-connotation hypothesis, sentences spoken by a male were rated as having more "potent" connotations than sentences spoken by a female.

Speech perception: a model of acoustic–phonetic analysis and lexical access

Article

Jul 1979

Dennis H. Klatt

Paying attention to differencesamong talkers

Article

Jan 1992

Acoustic phonetics

Article

M. Joos

Vocal Tract Normalisation as Demonstrated by Reaction Times

Chapter

Dec 1975

Variations in Vocal Tract Size between speakers are reflected in the acoustic characteristics of their speech but are largely normalised out in perception. Can such normalisation be measured as an additional stage in speech perception?

Articulation-Testing Methods: Consonantal Differentiation with a Closed-Response Set

Article

Jan 1965

Arthur S. House

The development and evaluation of a new speech‐intelligibility test suitable for routine use by operational personnel in determining the performance level of speech‐communication systems is described. The format used is similar to that described for a rhyme test but makes use of a closed‐response set. An experiment was performed to determine the general reliability of the test materials when administered to U.S. Air Force enlisted personnel under a wide range of signal‐to‐noise ratios. Testing of 18 listeners over a period of 30 days showed that repeated exposure to the materials did not change the levels of average response in any appreciable way. Analysis of the responses to individual phonetic elements shows that the test can be useful for diagnostic study as well as for over‐all evaluation of communication systems. Talker differences that appeared during the experiment and the statistical reliability and sensitivity of the materials are analyzed and discussed.

Memory for modality of presentation: Within modality discrimination

Article

Sep 1973
MEM COGNITION

Two experiments demonstrated that Ss are capable of making within-modality memory discriminations in both visual and auditory modalities. In Experiment I Ss studied mixed lists of pictures and labels representing common objects and were subsequently required to judge whether the original presentation was pictorial or verbal The high level of performance achieved on this task was unaffected by degree of categorical relatedness of items within method of presentation or by instructions to produce visual images when items were presented verbally. In Experiment II Ss demonstrated the ability to remember whether a sentence was originally presented by a male or a female speaker. Some strategies by which within-modality discrimination in memory might be accomplished are discussed.

Episodic encoding of voice attributes and recognition memory

Abstract

Recommended publications

Toward an Understanding of Individual Differences in Episodic Memory: Modeling the Dynamics Of Recog...

Episodic Encoding of Voice Attributes and Recognition Memory for Spoken Words

Episodic memory in printed word naming

Words and Voices: Episodic Traces in Spoken Word Identification and Recognition Memory

Recognition Memory for Exceptions to the Category Rule