Figure 6 - uploaded by Tanja Schultz
Content may be subject to copyright.
Comparison of Word Error Rates on a Context-Dependent and a Context-Independent System  

Comparison of Word Error Rates on a Context-Dependent and a Context-Independent System  

Source publication
Conference Paper
Full-text available
We present our recent advances in silent speech interfaces using electromyographic signals that capture the movements of the human articulatory muscles at the skin surface for recognizing continuously spoken speech. Previous systems were limited to speaker- and session-dependent recognition tasks on small amounts of train- ing and test data. In thi...

Context in source publication

Context 1
... The final context-dependent EMG recognizer is trained using the 600 acoustic models defined in the previous step. Figure 6 shows the recognition results of the context-dependent recognizer. The overall average performance of the context-independent system has a WER of 68.92%, which by context-dependent mod- eling drops to 60.97%. ...

Similar publications

Article
Full-text available
The natural, fast, stable and reliable interaction between human and machine is the ideal interaction mode pursued by human beings. Speech recognition is a process of pattern matching recognition. Effective speech detection technology can not only reduce the processing time of the system, improve the real-time and accuracy of the system processing,...
Article
Full-text available
This paper proposes an efficient method of simulated-data adaptation for robust speech recognition. The method is applied to tree-structured piecewise linear transformation (PLT). The original PLT selects an acoustic model using tree-structured HMMs and the acoustic model is adapted by input speech in an unsupervised scheme. This adaptation can deg...
Conference Paper
Full-text available
This paper proposes an efficient acoustic model adaptation method based on the use of simulated-data in maximum likelihood linear regression (MLLR) adaptation for robust speech recognition. Online MLLR adaptation is an unsupervised process which requires an input speech with phone labels transcribed automatically. Instead of using only the input si...
Conference Paper
Full-text available
Speech recognition is a natural means of interaction for a human with a smart assistive environment. In order for this interaction to be effective, such a system should attain a high recognition rate even under adverse conditions. Audio-visual speech recognition (AVSR) can be of help in such environments, especially under the presence of audio nois...

Citations

... Several studies on continuous speech recognition were done using sEMG signals measured from five neck and facial muscles. The studies were done aiming to compare speaker dependent, speaker independent and speaker adaptive models (Wand and Schultz 2009), to examine the session independent and session adaptive system (Wand and Schultz 2011) and to investigate the use of a supervised model vs. an unsupervised one in session independent recognizer (Wand and Schultz ‫4102‬ ‫‬ ). ...
... Data fusion for merging between several subjects is challenging and required utilization of special multi-subject models. One approach for multi-subject EMG based speech classification, as introduced by Wand and Schultz (2009), is context-independent or dependent phoneme-based recognizer. In their cross-speaker training, the system was trained on all measurements from all participating speakers except from the test speaker. ...
Article
Full-text available
Automatic speech recognition is the main form of man–machine communication. Recently, several studies have shown the ability to automatically recognize speech based on electromyography (EMG) signals of the facial muscles using machine learning methods. The objective of this study was to utilize machine learning methods for automatic identification of speech based on EMG signals. EMG signals from three facial muscles were measured from four healthy female subjects while pronouncing seven different words 50 times. Short time Fourier transform features were extracted from the EMG data. Principle component analysis (PCA) and locally linear embedding (LLE) methods were applied and compared for reducing the dimensions of the EMG data. K-nearest-neighbors was used to examine the ability to identify different word sets of a subject based on his own dataset, and to identify words of one subject based on another subject's dataset, utilizing an affine transformation for aligning between the reduced feature spaces of two subjects. The PCA and LLE achieved average recognizing rate of 81% for five words-sets in the single-subject approach. The best average recognition success rates for three and five words-sets were 88.8% and 74.6%, respectively, for the multi-subject classification approach. Both the PCA and LLE achieved satisfactory classification rates for both the single-subject and multi-subject approaches. The multi-subject classification approach enables robust classification of words recorded from a new subject based on another subject’s dataset and thus can be applicable for people who have lost their ability to speak.
... Some methods for silent speech rely on non-vocalized articulatory movements and are available to neurologically normal patients or those with damaged articulatory muscles. This typically involves the acquisition of signals from orofacial and laryngeal speech muscles using surface electromyography (EMG) electrodes [1,2,3]. The acquired signals are then transmitted to an analysis computer which is programmed to convert these EMG patterns to vocabulary. ...
Conference Paper
The ability of EEG signals to portray thoughts, feelings and unspoken words is being widely explored these days and has become a potential area of research. This is based upon the fact that electrical impulse is generated when a precise word is thought at the brain even before it reaches the vocal cords. These unspoken speech impulses can be analyzed and translated into distinct words which may enable a locked-in patient to communicate with the world. Based on the extensive survey on brain mapping techniques, EEG database shall be comprehended to design a proposal to achieve the target. Neurological advancements have led to the development of Emotiv unit which is a personal interface for human computer interaction, which shall be used to acquire and store EEG database. An algorithm shall then be developed for artefact removal and brain mapping so that word specific neural signals can be characterized before being vocalized. Mathematical tools like Fuzzy logic or Artificial Neural Network shall be explored for mapping the brain. Instrumentation shall be developed to convert electrical impulse of these unspoken words generated at the brain into digital form so that it can be converted into synthesized speech. The design has the potential to use Brain computer interface to assist locked-in patients to translate their thoughts into speech in real time applications.
... The feature extraction method for continuous speech was proposed at that time. Recently, Tanja Schultz [7] proposed new approach to improve performance of this system which this method was Speaker-adaptive training (MLLR adaptation). The large corpus was used for this research, 13 speaker and 101 words. ...
Article
Full-text available
This paper aims to investigate the features of surface electromyography (sEMG) which can classify the Thai tonal sound for the EMG speech recognition and synthesis system. Signals were captured at seven positions on the strap muscles as a subject was uttering nine monosyllabic words which each of the words includes five tones. Eight features, i.e. Root Mean Square (RMS), Variance (VAR), Waveform Length (WL), Willson Amplitude (WAMP), Median Frequency (MDF) and three types of the Spectral Moment (SM), were computed and plotted on the scatter graph to cluster the tones. The results indicate that the EMG signal of the strap muscles can clearly classify the tones into three groups, i.e. a rising tone, a high tone, and the remainder clustered as one group. Moreover, RMS, VAR and WL can classify the high tone better than other features. All of the Spectral Moments yield similarly the results of classification, especially they can classify well the rising tone. For the remainder of the tones, while the scatter graphs are considered without the rising tone and the high tone, the low tone can be separated from their group when WL or WAMP is used for classification only.
... CERT members in the four pilot districts were subject two exercises emulating a speaker-dependent and a speakerindependent process. The reader may refer to [11] and [12] in order to gain knowledge on speaker-dependent and speakerindependent computer science paradigms. Recordings from the first and second exercises that consisted of simple, meaningful, short sentences produced required speech samples. ...
Article
Full-text available
Freedom Fone (FF) is an Interactive Voice Response (IVR) System that integrates with the Global System for Mobile (GSM) telecommunications [1]. Sahana is a disaster information management expert system working with Internet technologies [2]. The Project intent was to mediate information between the FF and Sahana through the Emergency Data Exchange Language (EDXL) interoperable content standard [3]. It goal was to equip Sarvodaya, Sri Lanka's largest humanitarian organization, with voice-enabled disaster communication. The 3.52 Mean Opinion Score (MOS) for voice quality was an early automation challenge in introducing Automatic Speech Recognition (ASR). A 4.0 MOS was determined as a cut-point for classifying reliable voice data [4]. The Percent Difficult (PD) in an emulated speaker-independent scenario was 29.44% and a speaker-dependent scenario was 13.24%. Replacing human operators with ASR software proved inefficient [5] and [6]. This paper discusses uncertainties that are barriers to integrating voice enabled automated emergency communication services for response resource analysis and decision support.
... In 2010, Schultz and Wand (2010) reported similar average accuracies using phonetic feature bundling for modelling coarticulation on the same vocabulary and an accuracy of 90% for the bestrecognized speaker. In the last year's se eral issues of EMG-based recognition have been addressed such as, investigating new modeling schemes towards continuous speech (Jou et al., 2007; Schultz and Wand, 2010), speaker adaptation (Maier-Hein et al., 2005; Wand and Schultz, 2009) and the usability of the capturing devices (Manabe et al., 2003; Manabe and Zhang, 2004). Latest research in this area has been focused on the differences between audible and silent speech and how to decrease the impact of different speaking modes (Wand and Schultz, 2011a); the importance of acoustic feedback (Herff et al., 2011); EMG-based phone classification (Wand and Schultz, 2011b ); and sessionindependent training methods (Wand and Schultz, 2011c). ...
Conference Paper
Full-text available
A Silent Speech Interface (SSI) aims at performing Automatic Speech Recognition (ASR) in the absence of an intelligible acoustic signal. It can be used as a human-computer interaction modality in high-background-noise environments, such as living rooms, or in aiding speech-impaired individuals, increasing in prevalence with ageing. If this interaction modality is made available for users own native language, with adequate performance, and since it does not rely on acoustic information, it will be less susceptible to problems related to environmental noise, privacy, information disclosure and exclusion of speech impaired persons. To contribute to the existence of this promising modality for Portuguese, for which no SSI implementation is known, we are exploring and evaluating the potential of state-of-the-art approaches. One of the major challenges we face in SSI for European Portuguese is recognition of nasality, a core characteristic of this language Phonetics and Phonology. In this paper a silent speech recognition experiment based on Surface Electromyography is presented. Results confirmed recognition problems between minimal pairs of words that only differ on nasality of one of the phones, causing 50% of the total error and evidencing accuracy performance degradation, which correlates well with the exiting knowledge.
... There exist some studies on speaker adaptation for EMG-based speech recognition tasks (Maier-Hein et al., 2005;Wand and Schultz, 2009). Generally speaking, these experiments show that when data of different speakers is combined, the recognition performance degrades severely. ...
Conference Paper
Full-text available
This paper reports on our recent research in speech recognition by surface electromyography (EMG), which is the technology of recording the electric activation potentials of the human articulatory muscles by surface electrodes in order to recognize speech. This method can be used to create Silent Speech Interfaces, since the EMG signal is available even when no audible signal is transmitted or captured. Several past studies have shown that EMG signals may vary greatly between different recording sessions, even of one and the same speaker. This paper shows that session-independent training methods may be used to obtain robust EMG-based speech recognizers which cope well with unseen recording sessions as well as with speaking mode variations. Our best session-independent recognition system, trained on 280 utterances of 7 different sessions, achieves an average 21.93% Word Error Rate (WER) on a testing vocabulary of 108 words. The overall best session-adaptive recognition system, based on a session-independent system and adapted towards the test session with 40 adaptation sentences, achieves an average WER of 15.66%, which is a relative improvement of 21% compared to the baseline average WER of 19.96% of a session-dependent recognition system trained only on a single session of 40 sentences.
... In (Wand and Schultz, 2009b) we reported first EMG recognition results based on 26 recording sessions with 13 speakers of the audible part of the EMG-PIT pilot study subset. For each speaker, the audible part of the SPEC set was used for training, and the BASE set for testing. ...
... Note that and (Wand and Schultz, 2009b) only used a stacking width of 5 frames. On the EMG-PIT corpus, the stacking width of 15 frames gives significantly better results (Wand and Schultz, 2009a). ...
... In (Wand and Schultz, 2009b) we considered speaker-dependent and speakerindependent phoneme-based EMG recognizers. This means that we regard each frame of the EMG signal as the representation of the beginning, middle, or end state of a phoneme. ...
Article
This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even if it is spoken silently. Since speech is captured before it gets airborne, the resulting signal is not masked by ambient noise. The resulting Silent Speech Interface has the potential to overcome major limitations of conventional speech-driven interfaces: it is not prone to any environmental noise, allows to silently transmit confidential information, and does not disturb bystanders.We describe our new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition and report results on the EMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which we recently collected. Our results on speaker-dependent and speaker-independent setups show that modeling the interdependence of phonetic features reduces the word error rate of the baseline system by over 33% relative. Our final system achieves 10% word error rate for the best-recognized speaker on a 101-word vocabulary task, bringing EMG-based speech recognition within a useful range for the application of Silent Speech Interfaces.
... When considering a method that provides artificial communication to a speech-deprived individual, one must identify the most efficient means given the nature of the individual's impairment. Some methods rely on non-vocalized articulator movements or other sub-vocalizations (Betts and Jorgensen, 2006; Fagan et al., 2008; Jorgensen et al., 2003; Jou et al., 2006; Jou and Schultz, 2009; Maier-Hein et al., 2005; Mendes et al., 2008; Walliczek et al., 2006; Wand and Schultz, 2009) which can be helpful for speech deprived individuals (e.g. laryngectomy patients). ...
... Current methods for silent speech available to neurologically normal individuals or those with damaged vocal tracts typically involve the placement of surface electromyographic (EMG) electrodes on the orofacial and laryngeal speech articulators (Betts and Jorgensen, 2006; Fagan et al., 2008; Jorgensen et al., 2003; Jou et al., 2006; Jou and Schultz, 2009; Maier-Hein et al., 2005; Mendes et al., 2008; Wand and Schultz, 2009). Electrical recordings are transmitted from the electrodes to an analysis computer which has been trained to recognize a small vocabulary of words based upon the speaker's EMG pattern. ...
Article
Full-text available
This paper briefly reviews current silent speech methodologies for normal and disabled individuals. Current techniques utilizing electromyographic (EMG) recordings of vocal tract movements are useful for physically healthy individuals but fail for tetraplegic individuals who do not have accurate voluntary control over the speech articulators. Alternative methods utilizing EMG from other body parts (e.g., hand, arm, or facial muscles) or electroencephalography (EEG) can provide capable silent communication to severely paralyzed users, though current interfaces are extremely slow relative to normal conversation rates and require constant attention to a computer screen that provides visual feedback and/or cueing. We present a novel approach to the problem of silent speech via an intracortical microelectrode brain computer interface (BCI) to predict intended speech information directly from the activity of neurons involved in speech production. The predicted speech is synthesized and acoustically fed back to the user with a delay under 50 ms. We demonstrate that the Neurotrophic Electrode used in the BCI is capable of providing useful neural recordings for over 4 years, a necessary property for BCIs that need to remain viable over the lifespan of the user. Other design considerations include neural decoding techniques based on previous research involving BCIs for computer cursor or robotic arm control via prediction of intended movement kinematics from motor cortical signals in monkeys and humans. Initial results from a study of continuous speech production with instantaneous acoustic feedback show the BCI user was able to improve his control over an artificial speech synthesizer both within and across recording sessions. The success of this initial trial validates the potential of the intracortical microelectrode-based approach for providing a speech prosthesis that can allow much more rapid communication rates.
... (Busso, Deng et al. 2004) proposed a system based on bimodal data: video and acoustic to recognize the expression of the user. More recently, facial surface EMG sensors have also been successfully used to build continuous speech recognition systems (Wand and Schultz 2009) and speech synthesis systems from voiceless EMG signals (Toth, Wand et al. 2009). ...
... (Busso, Deng et al. 2004) proposed a system based on bimodal data: video and acoustic to recognize the expression of the user. More recently, facial surface EMG sensors have also been successfully used to build continuous speech recognition systems (Wand and Schultz 2009) and speech synthesis systems from voiceless EMG signals (Toth, Wand et al. 2009). ...
Conference Paper
Full-text available
In this paper we describe a way to enhance human computer interaction using facial Electromyographic (EMG) sensors. Indeed, to know the emotional state of the user enables adaptable interaction specific to the mood of the user. This way, Human Computer Interaction (HCI) will gain in ergonomics and ecological validity. While expressions recognition systems based on video need exaggerated facial expressions to reach high recognition rates, the technique we developed using electrophysiological data enables faster detection of facial expressions and even in the presence of subtle movements. Features from 8 EMG sensors located around the face were extracted. Gaussian models for six basic facial expressions - anger, surprise, disgust, happiness, sadness and neutral - were learnt from these features and provide a mean recognition rate of 92%. Finally, a prototype of one possible application of this system was developed wherein the output of the recognizer was sent to the expressions module of a 3D avatar that then mimicked the expression.
... Data was recorded using a modified version of the UKA {EEG|EMG}-Studio [15], which had originally been designed for recognition of silent speech from EEG and electromyographic signals [16], [17]. The software was extended by the functionality to present the pictures used for emotion induction and synchronize the recording with the picture presentation. ...
Conference Paper
Full-text available
In the field of interaction between humans and robots emotions have been disregarded for a long time. During the last few years interest in emotion research in this area has been constantly increasing as giving a robot the ability to react to the emotional state of the user can help to make the interaction more human-like and enhance the acceptance of the robots. In this paper we investigate a method to facilitate emotion recognition from electroencephalographic signals. For this purpose we developed a headband to measure electroencephalographic signals on the forehead. Using this headband we collected data from five subjects. To induce emotions we used 90 pictures from the International Affective Picture System (IAPS) belonging to the three categories pleasant, neutral, and unpleasant. For emotion recognition we developed a system based on support vector machines (SVMs). With this system an average recognition rate of 47.11% could be achieved on subject dependent recognition.