Conference Paper

Audio, Visual and Audiovisual intelligibility of vowels produced in noise

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A second argument is that the gain in intelligibility from an auditory-only to an audiovisual perception of utterances is weaker in Lombard speech, compared to normal speech [8]. On the contrary, vowels produced in noise are in average more easily recognized in visual-only and audiovisual modalities, as compared to vowels produced in silence [9]. This study aims at bringing a third element of answer, by examining whether, in noise: -speakers enhance significantly more their visible articulatory movements when their speech partner can see them compared to when the partner can only hear them. ...
Conference Paper
Full-text available
Speech produced in noise (or Lombard speech) is characterized by increased vocal effort, but also by amplified lip gestures. The current study examines whether this enhancement of visible speech cues may be sought by the speaker, even unconsciously, in order to improve his visual intelligibility. One subject played an interactive game in a quiet situation and then in 85dB of cocktail-party noise, for three conditions of interaction: without interaction, in face-to-face interaction, and in a situation of audio interaction only. The audio signal was recorded simultaneously with articulatory movements, using 3D electromagnetic articulography. The results showed that acoustic modifications of speech in noise were greater when the interlocutor could not see the speaker. Furthermore, tongue movements that are hardly visible were not particularly amplified in noise. Lip movements that are very visible were not more enhanced in noise when the interlocutors could see each other. Actually, they were more enhanced in the situation of audio interaction only. These results support the idea that this speaker did not make use of the visual channel to improve his intelligibility, and that his hyper-articulation was just an indirect correlate of increased vocal effort.
Article
Full-text available
Over the last century, researchers have collected a considerable amount of data reflecting the properties of Lombard speech, i.e., speech in a noisy environment. The documented phenomena predominately report effects on the speech signal produced in ambient noise. In comparison, relatively little is known about the underlying articulatory patterns of Lombard speech, in particular for lingual articulation. Here the authors present an analysis of articulatory recordings of speech material in babble noise of different intensity levels and in hypoarticulated speech and report quantitative differences in relative expansion of movement of different articulatory subsystems (the jaw, the lips and the tongue) as well as in relative expansion of utterance duration. The trajectory modifications for one articulator can be relatively reliably predicted by those for another one, but subsystems differ in a degree of continuity in trajectory expansion elicited across different noise levels. Regression analysis of articulatory modifications against durational expansion shows further qualitative differences between the subsystems, namely, the jaw and the tongue. The findings are discussed in terms of possible influences of a combination of prosodic, segmental, and physiological factors. In addition, the Lombard effect is put forward as a viable methodology for eliciting global articulatory variation in a controlled manner.
Article
Full-text available
In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted.
Article
Full-text available
An earlier study compared audiovisual perception of speech 'produced in environmental noise' (Lombard speech) and speech 'produced in quiet' with the same en- vironmental noise added. The results and showed that listeners make differential use of the visual information depending on the recording condition, but gave no in- dication of how or why this might be so. A possible confound in that study was that high audio presentation levels might account for the small visual enhancements observed for Lombard speech. This paper reports results for a second perception study using much lower acous- tic presentation levels, compares them with the results of the previous study, and integrates the perception results with analyses of the audiovisual production data: face and head motion, audio amplitude (RMS), and parame- ters of the spectral acoustics (line spectrum pairs). Index Terms: audiovisual speech, Lombard speech, pro- duction and perception links,
Conference Paper
Full-text available
In this study we explore how acoustic and lip articulatory characteristics of bilabial consonants and three extreme French vowels vary in Lombard speech. In the light of several theories of segments perception we have shown that formant modifications should decrease the audio intelligibility of vowels in noise. On the contrary, modification in lip articulation should improve the visual intelligibility of vowels and bilabial consonants. This is not in agreement with previous studies which reported a global increased intelligibility of Lombard speech especially in the audio domain and not a lot in the visual one. Thus, more detailed research is needed about the segmental and prosodic contribution to the increased intelligibility of Lombard speech.
Article
Full-text available
To examine the influence of sound immersion techniques and speech production tasks on speech adaptation in noise. In Experiment 1, we compared the modification of speakers' perception and speech production in noise when noise is played into headphones (with and without additional self-monitoring feedback) or over loudspeakers. We also examined how this sound immersion effect depends on noise type (broadband or cocktail party) and level (from 62 to 86dB SPL). In Experiment 2, we compared the modification of acoustic and lip articulatory parameters in noise when speakers interact or not with a speech partner. Speech modifications in noise were greater when cocktail party noise was played in headphones than over loudspeakers. Such an effect was less noticeable in broadband noise. Adding a self-monitoring feedback into headphones reduced this effect but did not completely compensate for it. Speech modifications in noise were greater in interactive situation and concerned parameters that may not be related to voice intensity. The results support the idea that the Lombard effect is both a communicative adaptation and an automatic regulation of vocal intensity. The influence of auditory and communicative factors has some methodological implications on the choice of appropriate paradigms to study the Lombard effect.
Article
Full-text available
Bimodal perception leads to better speech understanding than auditory perception alone. We evaluated the overall benefit of lip-reading on natural utterances of French produced by a single speaker. Eighteen French subjects with good audition and vision were administered a closed set identification test of VCVCV nonsense words consisting of three vowels [i, a, y] and six consonants [b, v, z, 3, R, l]. Stimuli were presented under both auditory and audio-visual conditions with white noise added at various signal-to-noise ratios. Identification scores were higher in the bimodal condition than in the auditory-alone condition, especially in situations where acoustic information was reduced. The auditory and audio-visual intelligibility of the three vowels [i, a, y] averaged over the six consonantal contexts was evaluated as well. Two different hierarchies of intelligibility were found. Auditorily, [a] was most intelligible, followed by [i] and then by [y]; whereas visually [y] was most intelligible, followed by [a] and [i]. We also quantified the contextual effects of the three vowels on the auditory and audio-visual intelligibility of the consonants. Both the auditory and the audio-visual intelligibility of surrounding consonants was highest in the [a] context, followed by the [i] context and lastly the [y] context.
Article
Full-text available
A two-part study examined recognition of speech produced in quiet and in noise by normal hearing adults. In Part I 5 women produced 50 sentences consisting of an ambiguous carrier phrase followed by a unique target word. These sentences were spoken in three environments: quiet, wide band noise (WBN), and meaningful multi-talker babble (MMB). The WBN and MMB competitors were presented through insert earphones at 80 dB SPL. For each talker, the mean vocal level, long-term average speech spectra, and mean word duration were calculated for the 50 target words produced in each speaking environment. Compared to quiet, the vocal levels produced in WBN and MMB increased an average of 14.5 dB. The increase in vocal level was characterized by increased spectral energy in the high frequencies. Word duration also increased an average of 77 ms in WBN and MMB relative to the quiet condition. In Part II, the sentences produced by one of the 5 talkers were presented to 30 adults in the presence of multi-talker babble under two conditions. Recognition was evaluated for each condition. In the first condition, the sentences produced in quiet and in noise were presented at equal signal-to-noise ratios (SNR(E)). This served to remove the vocal level differences between the speech samples. In the second condition, the vocal level differences were preserved (SNR(P)). For the SNR(E) condition, recognition of the speech produced in WBN and MMB was on average 15% higher than that for the speech produced in quiet. For the SNR(P) condition, recognition increased an average of 69% for these same speech samples relative to speech produced in quiet. In general, correlational analyses failed to show a direct relation between the acoustic properties measured in Part I and the recognition measures in Part II.
Article
This study investigates the hypothesis that speakers make active use of the visual modality in production to improve their speech intelligibility in noisy conditions. Six native speakers of Canadian French produced speech in quiet conditions and in 85 dB of babble noise, in three situations: interacting face-to-face with the experimenter (AV), using the auditory modality only (AO), or reading aloud (NI, no interaction). The audio signal was recorded with the three-dimensional movements of their lips and tongue, using electromagnetic articulography. All the speakers reacted similarly to the presence vs absence of communicative interaction, showing significant speech modifications with noise exposure in both interactive and non-interactive conditions, not only for parameters directly related to voice intensity or for lip movements (very visible) but also for tongue movements (less visible); greater adaptation was observed in interactive conditions, though. However, speakers reacted differently to the availability or unavailability of visual information: only four speakers enhanced their visible articulatory movements more in the AV condition. These results support the idea that the Lombard effect is at least partly a listener-oriented adaptation. However, to clarify their speech in noisy conditions, only some speakers appear to make active use of the visual modality.
Article
Perception results for three studies are presented that address the role of Lombard speech in auditory, visual, and auditory‐visual speech perception. Predictably, when presented in auditory‐only conditions with masking noise, listeners recover speech recorded in noise (Lombard speech) better than speech recorded in quiet and presented with the same level of masking noise. However, there is almost no difference in listener performance when Lombard and quiet speech are presented audiovisually with masking noise. Both conditions are enhanced compared to auditory‐alone conditions, but there is no indication that the facial motion correlates, demonstrated previously for quiet speech [H.C. Yehia, et al., Speech Commun. 26, 23–44 (1998)], play as strong a role in enhancing auditory‐visual processing of Lombard speech, even though Lombard speech is accompanied by larger facial motions. Perhaps it is no surprise that, at a cocktail party, one leans in with an ear rather than with the eyes. [Research supported by CFI and NSERC.]
Article
Loudness manipulation is an important clinical tool in reducing functional communication limitations and increasing speaking participation in individuals with dysarthria. Increasing loudness appears to influence all domains of the speech production system (i.e., respiratory, laryngeal, and orofacial), resulting in increased effort and coordination. The purpose of the present study was to investigate the influence speaking under different levels of background noise has on lip contact pressure, a measure of physiological effort, during bilabial consonant production. Ten young adults ranging in age from 20 to 24 years read 30 sentences under three different levels of multitalker babble background noise (0, 40, & 80 dBHL). An Entran Pressure Transducer was used to acquire lip contact pressure for words containing the bilabial consonants /p/, /b/, and /m/. Intensity of the voice significantly increased across noise conditions; however, static and dynamic measures of articulatory contact pressures (ACP) were not significantly influenced by background noise. Increase in physiological effort as represented by ACP appears not be related to increasing intensity while speaking in background noise. Additional study of the factors influencing different aspects of articulatory contact pressure is recommended for individuals with and without motor speech disorders.
Article
The aim of this study was to research the associations between noise (ambient and activity noise) and objective metrics of teachers' voices in real working environments, i.e. classrooms. Thirty-two female and eight male teachers from 14 elementary schools were randomly selected for the study. Ambient noise was measured during breaks in unoccupied classrooms, likewise the noise caused by pupils' activity during lessons. Voice samples were recorded before and after a working day. Voice variables measured were sound pressure level (voice SPL), fundamental frequency (F0), jitter, shimmer, and the tilt of the sound spectrum slope (alpha ratio). The ambient noise correlated most often with the F0 of males and voice SPL, while activity noise correlated with alpha ratio and perturbation values. Teachers working under louder ambient noise spoke more loudly before work than those working in lower noise levels. Voice variables generally changed less during work among teachers working in loud activity noise than among those working in lower noise levels. Ambient and activity noises affect teachers' voice use. Under loud ambient noise teachers seem to speak habitually loudly and under loud activity noise levels teachers' ability to react to loading deteriorates.
Article
Talkers modify their speech production in noisy environments partly as a reflex but also as an intentional communicative strategy to facilitate the transmission of the speech signal to the interlocutor. Previous studies have shown that talkers can adapt both auditory and visual elements of speech produced in noise. The current study examined whether auditory and visual speech production would be affected by being able to see their interlocutor or not. Participants completed an interactive communication game in various quiet and in noise conditions with/without being able to see their interlocutor. The results showed that the amplitude of talkers" speech modifications was significantly lower when interlocutors could see each other. Furthermore, talkers instead increased the saliency of their visual speech production (measured as lip-area) in noisy conditions for face-to-face communication. These results suggest that talkers actively monitor their environment and adopt appropriate speech production for efficient communication.
Article
"Oral speech intelligibility tests were conducted with, and without, supplementary visual observation of the speaker's facial and lip movements. The difference between these two conditions was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test. The visual contribution to oral speech intelligibility (relative to its possible contribution) is, to a first approximation, independent of the speech-to-noise ratio under test. However, since there is a much greater opportunity for the visual contribution at low speech-to-noise ratios, its absolute contribution can be exploited most profitably under these conditions." (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
A noisy environment usually degrades the intelligibility of a human speaker or the performance of a speech recognizer. Due to this noise, a phenomenon appears which is caused by the articulatory changes made by speakers in order to be more intelligible in the noisy environment: the Lombard effect. Over the last few years, special emphasis has been placed on analyzing and dealing with the Lombard effect within the framework of Automatic Speech Recognition. Thus, the first purpose of the work presented in this paper was to study the possible common tendencies of some acoustic features in different phonetic units for Lombard speech. Another goal was to study the influence of gender in the characterization of the above tendencies. Extensive statistical tests were carried out for each feature and each phonetic unit, using a large Spanish continuous speech corpus, The results reported here confirm the changes produced in Lombard speech with regard to normal speech. Nevertheless, some new tendencies have been observed from the outcome of the statistical tests.
Article
Previous studies have documented phenomena involving the modification of human speech in special communication circumstances. Whether speaking to a hearing-impaired person (clear speech) or in a noisy environment (Lombard speech), speakers tend to make similar modifications to their normal, conversational speaking style in order to increase the understanding of their message by the listener. One strategy characteristic of the above speech types is to increase consonant power relative to the signal power of adjacent vowels and is referred to as consonant–vowel (CV) ratio boosting. An automated method of speech enhancement using CV ratio boosting is called energy redistribution voiced/unvoiced (ERVU). To characterize the performance of ERVU, 25 listeners responded to 500 words in a two-word, forced-choice experiment in the presence of energetic masking noise. The test material was a vocabulary of confusable monosyllabic words spoken by 8 male and 8 female speakers, and the conditions tested were a control (unmodified speech), ERVU, and a high-pass filter (HPF). Both ERVU and the HPF significantly increased recognition accuracy compared to the control. Nine of the 16 speakers were significantly more intelligible when ERVU or the HPF was used, compared to the control, while no speaker was less intelligible. The results show that ERVU successfully increased intelligibility of speech using a simple automated segmentation algorithm, applicable to a wide variety of communication systems such as cell phones and public address systems.
Article
Talkers modify the way they speak in the presence of noise. As well as increases in voice level and fundamental frequency (F0), a flattening of spectral tilt is observed. The resulting “Lombard speech” is typically more intelligible than speech produced in quiet, even when level differences are removed. What is the cause of the enhanced intelligibility of Lombard speech? The current study explored the relative contributions to intelligibility of changes in mean F0 and spectral tilt. The roles of F0 and spectral tilt were assessed by measuring the intelligibility gain of non-Lombard speech whose mean F0 and spectrum were manipulated, both independently and in concert, to simulate those of natural Lombard speech. In the presence of speech-shaped noise, flattening of spectral tilt contributed greatly to the intelligibility gain of noise-induced speech over speech produced in quiet while an increase in F0 did not have a significant influence. The perceptual effects of spectrum flattening was attributed to its ability of increasing the amount of speech time–frequency plane “glimpsed” in the presence of noise. However, spectral tilt changes alone could not fully account for the intelligibility of Lombard speech. Other changes observed in Lombard speech such as durational modifications may well contribute to intelligibility.
Article
Seeing the talker improves the intelligibility of speech degraded by noise (a visual speech benefit). Given that talkers exaggerate spoken articulation in noise, this set of two experiments examined whether the visual speech benefit was greater for speech produced in noise than in quiet. We first examined the extent to which spoken articulation was exaggerated in noise by measuring the motion of face markers as four people uttered 10 sentences either in quiet or in babble-speech noise (these renditions were also filmed). The tracking results showed that articulated motion in speech produced in noise was greater than that produced in quiet and was more highly correlated with speech acoustics. Speech intelligibility was tested in a second experiment using a speech-perception-in-noise task under auditory-visual and auditory-only conditions. The results showed that the visual speech benefit was greater for speech recorded in noise than for speech recorded in quiet. Furthermore, the amount of articulatory movement was related to performance on the perception task, indicating that the enhanced gestures made when speaking in noise function to make speech more intelligible.
Article
Automatic speech recognition experiments show that, depending on the task performed and how speech variability is modeled, automatic speech recognizers are more or less sensitive to the Lombard reflex. To gain an understanding about the Lombard effect with the prospect of improving performance of automatic speech recognizers, (1) an analysis was made of the acoustic-phonetic changes occurring in Lombard speech, and (2) the influence of the Lombard effect on speech perception was studied. Both acoustic and perceptual analyses suggest that the influence of the Lombard effect on male and female speakers is different. The analyses also bring to light that, even if some tendencies across speakers can be observed consistently, the Lombard reflex is highly variable from speaker to speaker. Based on the results of the acoustic and perceptual studies, some ways of dealing with Lombard speech variability in automatic speech recognition are also discussed.
Lombard speech: Auditory (a), visual (v) and av effects
  • C Davis
  • J Kim
  • K Grauwinkel
  • H Mixdorff
C. Davis, J. Kim, K. Grauwinkel, and H. Mixdorff, "Lombard speech: Auditory (a), visual (v) and av effects," in Proceedings of the third international conference on speech prosody. Citeseer, 2006, pp. 248-252.