Figure 7 - uploaded by Richard Bowden
Content may be subject to copyright.
Tracking results at different frames for sequences with subjects engaged in naturalistic conversation, where time progresses from top to bottom (i.e. top frame is from the earlier part of the sequence and bottom frame from the latter part of the video sequence). Shown are the performance of the previous LP tracking method (top) and the new method (bottom). 

Tracking results at different frames for sequences with subjects engaged in naturalistic conversation, where time progresses from top to bottom (i.e. top frame is from the earlier part of the sequence and bottom frame from the latter part of the video sequence). Shown are the performance of the previous LP tracking method (top) and the new method (bottom). 

Source publication
Article
Full-text available
Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all humans, suffer from unreliability. Here we report the results of a long-term study in automatic lip-reading with the objective of converting video-to-text (V2T). The V2T problem is surprising in that some aspects that look tricky, such a...

Context in source publication

Context 1
... tracking results can be seen in Figure 7, where the tracking results of the previously used linear predictors and currently proposed method are compared. The new method based on the non-linear regression forests is far more robust to pose changes than the previous approach. Importantly, the mechanism providing this increase in robustness is not specific to only pose changes. This was also carried out on a number of long sequences of subjects with naturalistic conversation. The method is robust to occlusions and pose changes whilst the linear predictor method suffers catastrophic failure of multiple facial feature points. Any other highly non-linear variations can, in practice, also be modeled and accounted for using the new method. Consequently, in the future, we will attempt to quantify the performance of the system due to directional lighting changes that can occur in both indoor and outdoor environments using exactly the same mechanism. ...

Similar publications

Conference Paper
Full-text available
This paper proposes a method of tracking the lips in the system of audio-visual speech recognition. Presented methods consists of a face detector, face tracker, lip detector, lip tracker, and word classifier. In speech recognition systems, the audio signal is exposed to a large amount of acoustic noise, therefor scientists are looking for ways to r...

Citations

... In the traditional lip-reading methods, the hidden Markov model (HMM) is mostly used to convert visual features into speech units [9][10][11]. In recent years, deep learning-based methods have become the mainstream method of lip-reading research. ...
Article
Full-text available
Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human–computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
... Silent speech recognition (SSR) technology provides a solution to the aforementioned challenges because it does not rely on acoustic signals but other medium. Several signal modalities have been applied to realize SSR by capturing the movement of articulatory muscles or extracting neural information, such as the electromagnetic arthrography [5], the ultrasound or optical images of tongue or lips [6]- [8], the electromyogram (EMG) [9]- [12], and the electroencephalogram [13], [14]. ...
Article
Full-text available
Finer-grained decoding at a phoneme or syllable level is a key technology for continuous recognition of silent speech based on surface electromyogram (sEMG). This paper aims at developing a novel syllable-level decoding method for continuous silent speech recognition (SSR) using spatio-temporal end-to-end neural network. In the proposed method, the high-density sEMG (HD-sEMG) was first converted into a series of feature images, and then a spatio-temporal end-to-end neural network was applied to extract discriminative feature representations and to achieve syllable-level decoding. The effectiveness of the proposed method was verified with HD-sEMG data recorded by four pieces of 64-channel electrode arrays placed over facial and laryngeal muscles of fifteen subjects subvocalizing 33 Chinese phrases consisting of 82 syllables. The proposed method outperformed the benchmark methods by achieving the highest phrase classification accuracy (97.17 ± 1.53%, p < 0.05), and lower character error rate (3.11 ± 1.46%, p < 0.05). This study provides a promising way of decoding sEMG towards SSR, which has great potential applications in instant communication and remote control.
... In traditional lip reading models, lip-movement feature extraction methods include geometry, motion, statistical model, or image transforms [11], [12]. Then Most methods utilized Hidden Markov Models(HMMs) to transform visual features into speech units [13]- [15]. With the development of deep learning, machine lip reading has made significant progress in recent years. ...
Article
Full-text available
Lip reading has received increasing attention in recent years. It judges the content of speech based on the movement of the speaker’s lips. The rapid development of deep learning has promoted progress in lip reading. However, due to lip reading needs to process the information of continuous video frames, it is necessary to consider the correlation information between adjacent images and the correlation between long-distance images. Moreover, lip reading recognition mainly focuses on the subtle changes of lips and their surrounding environment, and it is necessary to extract the subtle features of small-size images. Therefore, the performance of machine lip reading is generally not high, and the research progress is slow. In order to improve the performance of machine lip reading, we propose a lip reading method based on 3D convolutional vision transformer (3DCvT), which combines vision transformer and 3D convolution to extract the spatio-temporal feature of continuous images, and take full advantage of the properties of convolutions and transformers to extract local and global features from continuous images effectively. The extracted features are then sent to a Bidirectional Gated Recurrent Unit (BiGRU) for sequence modeling. We proved the effectiveness of our method on large-scale lip reading datasets LRW and LRW-1000 and achieved state-of-the-art performance.
... Human pose estimation is not only an important com-puter vision problem but also plays a vital role in the following in a number of real-world applications. Such as Action recognition [4], Gaming [5], augmented reality [6], Pedestrian detection [7], Automated Lip Reading [8], Sports Scenes [9], Medical Imaging [10], Digital Entertainment [11], Human-Computer Interaction [12], Video Surveillance [13], Face Recognition [14], Gesture Recognition [15]. There are various devices also built to capture motion and gesture capturing. ...
... Such Silent Speech Interfaces (SSIs) use a range biosignals other than microphone-recorded audible speech to infer information from speech [1,2]. Examples of such biosignals include Permanent Magnetic or Electromagnetic Articulography [3,4,5], lip reading from video [6], ultrasound imaging of the speech apparatus [7], electroencephalography-or functional near infrared spectroscopy based brain-computerinterfaces [8,9] and, as in our work, surface electromyography [10] -the recording of electrical muscle activity using surface electrodes attached to the face. ...
... However the surprising result is the remarkable resilience that computer lip-reading shows to resolution. Given that modern experiments in lip-reading usually take place with high-resolution video ( [16] and [1] for example) the disparity between measured performance (shown here) and assumed performance is very striking. Of course higher resolution may be beneficial for tracking but, in previous work we have been able to show other factors believed to be highly detrimental to lip-reading such as off-axis views [5] actually have the ability to improve performance rather than degrade it. ...
... Of course higher resolution may be beneficial for tracking but, in previous work we have been able to show other factors believed to be highly detrimental to lip-reading such as off-axis views [5] actually have the ability to improve performance rather than degrade it. We have also noted that previous shibboleths of outdoor video, poor lighting and agile motion affecting performance can all be overcome [1]. It seems that in lip-reading it is better to trust the data than conventional wisdom. ...
Article
Full-text available
Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test recognizers so we can measure the affect of video resolution on recognition accuracy. We conclude that, contrary to common practice, resolution need not be that great for automatic lip-reading. However it is highly unlikely that automatic lip-reading can work reliably when the distance between the bottom of the lower lip and the top of the upper lip is less than four pixels at rest.
... There has been consistent and sustained interest in building computer systems that can understand what humans are saying without hearing the audio channel. [1][2][3][4] There are obvious applications for such systems in security but also in noisy environments such as cockpits, battlefields and crowds where audio recognition is likely to be impossible or highly degraded. Early work consisted of very small vocabularies (often fewer than 10 words), 5 single speakers, high-definition video (often the camera would be zoomed into the lip region or the frame rate would be greater than 60 fields per second) 6 and, often, the talker would wear special lipstick to allow easy segmentation and analysis of the lips. ...
Article
Full-text available
In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system.
... Three different types of features are generally extracted from the ROI: texture-based features, shape-based features, or a combination of these two [1,16]. Texture-based features are computed directly on the pixel values of the ROI. ...
... Shape-based features attempt to take into account the actual shape of the mouth by extracting the contours or computing geometrical distances between certain points of interest around the mouth [1]. Nowadays, many studies that make use of these types of features directly extract these points and shapes with a face tracker and apply PCA to them, as performed by active appearance models (AAMs) [16]. ...
Conference Paper
Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the complementary information contained in recordings from different view angles improves the results significantly. For example, the sentence correctness on the test set is increased from 76% for the highest performing single view (30°) to up to 83% when combining this view with the frontal and 60° view angles.
... Lipreading can be used to complement or augment speech recognition, particularly in the presence of noise [3,14], and for purely visual speech recognition [4,15,5]. In the latter case, ambiguities due to incomplete information (e.g. about voicing) can be mitigated by augmenting the video stream with ultrasound images of the vocal tract [16]. ...
... Classification is often done with Hidden Markov Models (HMMs), e.g. [30,15,31,32]. Mouth tracking is done as a preprocessing step [32,15,5]. ...
... [30,15,31,32]. Mouth tracking is done as a preprocessing step [32,15,5]. For a comprehensive review see [33]. ...
... Lipreading can be used to complement or augment speech recognition, particularly in the presence of noise [3,14], and for purely visual speech recognition [4,15,5]. In the latter case, ambiguities due to incomplete information (e.g. about voicing) can be mitigated by augmenting the video stream with ultrasound images of the vocal tract [16]. ...
... Classification is often done with Hidden Markov Models (HMMs), e.g. [30,15,31,32]. Mouth tracking is done as a preprocessing step [32,15,5]. ...
... [30,15,31,32]. Mouth tracking is done as a preprocessing step [32,15,5]. For a comprehensive review see [33]. ...
Article
We present a Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence. Domain-adversarial training is integrated into the optimization of a lipreader based on a stack of feedforward and LSTM (Long Short-Term Memory) recurrent neural networks, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker. On pairs of different source and target speakers, we achieve a relative accuracy improvement of around 40% with only 15 to 20 seconds of untranscribed target speech data. On multi-speaker training setups, the accuracy improvements are smaller but still substantial.