Figure 4 - uploaded by Pepito Perez
Content may be subject to copyright.
Recovery of lock. Subject moves rapidly (top), and using only vision to track, lock is lost, and does not recover. Using sound and vision (bottom), lock recovers.

Recovery of lock. Subject moves rapidly (top), and using only vision to track, lock is lost, and does not recover. Using sound and vision (bottom), lock recovers.

Source publication
Conference Paper
Full-text available
Video telephony could be considerably enhanced by provision of a tracking system that allows freedom of movement to the speaker while maintaining a well-framed image, for transmission over limited bandwidth. Already commercial multi-microphone systems exist which track speaker direction in order to reject background noise. Stereo sound and vision a...

Context in source publication

Context 1
... particles latch on to prominent features in the background and never recover the subject again. However, in the second experiment, where the visual measurements are combined with the sound measurements, the particle fil- ter is able to immediately reinitialise on the subject as soon as it speaks, and the subsequent tracking is successful, as is illustrated by the key frames in the bottom of figure 4. ...

Citations

... • Video-based sound source localization (SSL) [5], [8], [20], [60], [65], [66], [85], [94], [95], [97], [103], [110], [131], [138], [139], [148], [151], [153], [162]- [166] involves marking pixels' correspondence to each sound source, such as vehicles, in video frames. When the source of sound is a person, we have the audiovisual speaker localization (AVSL) [23], [35] problem, which involves identifying and locating the speaker(s) in an audio-visual scene, such as identifying and locating a person speaking in a video and tracking the speaker [21], [22], [33]. The problem's difficulty level arises if the number of pixels in a video frame for the sound source is small compared with the image size. ...
... This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4 recognition [31], movie genre recognition [31], [32], video summarization [40], tracking [5], [21], [22], [33], video search & retrieval [36]. ...
... In the pre-deep learning era, researchers used various methods to solve the problem of audio-visual (AV) source localization, such as probabilistic models [20], [38] and canonical correlation analysis (CCA) [20], [100]. Speaker tracking tasks [33] also require speaker or source localization to track the speaker effectively. For localization, Vermaak et al. [33] uses time delay of arrival. ...
Article
Full-text available
The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.
... due to the presence of acoustic noise. Other modalities, such as thermal vision and laser rangefinders, could also be considered, however, we will focus on audio-visual sensors, as in [7], due to their widespread use, low cost, and easy installation [8]. More specifically, in our work, we consider the visual measurements obtained by the CAMShift [9] face detector [10], and the audio measurements, such as the direction of arrival (DOA) of the sources [11]. ...
Article
Full-text available
Sequential Monte Carlo probability hypothesis density (SMC-PHD) filtering is a popular method used recently for audiovisual (AV) multi-speaker tracking. However, due to the weight degeneracy problem, the posterior distribution can be represented poorly by the estimated probability, when only a few particles are present around the peak of the likelihood density function. To address this issue, we propose a new framework where particle flow (PF) is used to migrate particles smoothly from the prior to the posterior probability density. We consider both zero and non-zero diffusion particle flows (ZPF/NPF), and developed two new algorithms, AV-ZPF-SMC-PHD and AV-NPF-SMC-PHD, where the speaker states from the previous frames are also considered for particle relocation. The proposed algorithms are compared systematically with several baseline tracking methods using the AV16.3, AVDIAR and CLEAR datasets, and are shown to offer improved tracking accuracy and average effective sample size (ESS).
... To address this problem, extra information obtained from other sensors/modalities can be utilised, including colour (RGB) images [5]- [8], audio streams [4], [9]- [11], wireless network and mobile technologies [12], thermal infra-red sensors [13] and recently depth sensors such as stereo cameras [14], laser range finders [15] and RGB-depth (RGB-D) sensors [16]- [19]. Cross-modality tracking has attracted much attention in the last decade, mostly in the audio-visual domain [20]- [25], the RGB-D domain [26]- [29], and the RGB-D and thermal domain [30]. ...
Article
In object-based spatial audio system, positions of the audio objects (e.g. speakers) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as an essential element to mimic human hearing and to monitor and analyse the scene, including localisation and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberations and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the PHD filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate mis-detections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the mis-detections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing mis-detections.
... Speaker tracking with multi-modal information has also gained attention, and many approaches have been proposed in the past decade using audio-visual information [2,6,23,29,[75][76][77][78][79][80][81], providing the complementary characteristics of each modality. The differences among these existing works arise from the overall objective such as tracking either single or multiple speakers and the specific detection/tracking framework. ...
... Unlike [2,80], particle filters are used in Ref. [81] to estimate the predictions from audio-and video-based measurements and audio-visual information fusion is performed at the feature level. In other words, the independent particle coordinates from the features of both modalities are fused for speaker tracking. ...
... In other words, the independent particle coordinates from the features of both modalities are fused for speaker tracking. These works [2,23,80,81] have focused on the single-speaker case which cannot directly address the tracking problem for multiple speakers. ...
Chapter
Full-text available
Target motion tracking found its application in interdisciplinary fields, including but not limited to surveillance and security, forensic science, intelligent transportation system, driving assistance, monitoring prohibited area, medical science, robotics, action and expression recognition, individual speaker discrimination in multi-speaker environments and video conferencing in the fields of computer vision and signal processing. Among these applications, speaker tracking in enclosed spaces has been gaining relevance due to the widespread advances of devices and technologies and the necessity for seamless solutions in real-time tracking and localization of speakers. However, speaker tracking is a challenging task in real-life scenarios as several distinctive issues influence the tracking process, such as occlusions and an unknown number of speakers. One approach to overcome these issues is to use multi-modal information, as it conveys complementary information about the state of the speakers compared to single-modal tracking. To use multi-modal information, several approaches have been proposed which can be classified into two categories, namely deterministic and stochastic. This chapter aims at providing multimedia researchers with a state-of-the-art overview of tracking methods, which are used for combining multiple modalities to accomplish various multimedia analysis tasks, classifying them into different categories and listing new and future trends in this field.
... Существует множество приложений, в которых производится объединение аудио и видео, такие как распознавание речи [7][8][9][10][11][12][13], распознавание диктора [14][15][16], биометрическая верификация [17][18][19][20][21], обнаружение события [22][23][24][25], слежение за человеком или объектом [26][27][28][29][30][31], локализация и слежение за активным диктором [32,33], анализ музыкального контента, распознавание эмоций, видеопоиск, человеко-машинное взаимодействие, обнаружение голосовой активности и разделение источников звукового сигнала [34][35][36]. Очевидно, что в некоторых приложениях используются изображения лиц, а иногда даже движения всего тела, а не только лица. ...
Article
Full-text available
The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV) fusion (integration) of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.
... There is a consensus that different modalities are complementary to each other, which has motivated an increasing interest in cross-modal tracking in the last decade. Most of these works are done in the audio-visual domain [24,8,9]. Combination of other modalities has recently started to become more popular. ...
... Statistical optimality has always been a dominant approach in the field for sensor fusion because it provides an inherent link to information theory, which is the mathematically intuitive approach for the combination of noisy data originating from multiple sensors. Source detection and localization received a lot of attention in the last decade and there are a variety of unisensory [5] [6] [7] as well as multisensory [8] [9] [10] approaches that have processes measurements between the execution of actions in order to avoid noisy measurements resulting from motor sounds and moving sensors. The recorded stereo audio data is then transformed into a time-frequency representation, which is used for sound source localization by estimating interaural temporal differences (ITDs). ...
Article
Full-text available
We present a system for sensorimotor audio-visual source localization on a mobile robot. We utilize a particle filter for the combination of audio-visual information and for the temporal integration of consecutive measurements. Although the system only measures the current direction of the source, the position of the source can be estimated because the robot is able to move and can therefore obtain measurements from different directions. These actions by the robot successively reduce uncertainty about the source’s position. An information gain mechanism is used for selecting the most informative actions in order to minimize the number of actions required to achieve accurate and precise position estimates in azimuth and distance. We show that this mechanism is an efficient solution to the action selection problem for source localization, and that it is able to produce precise position estimates despite simplified unisensory preprocessing. Because of the robot’s mobility, this approach is suitable for use in complex and cluttered environments. We present qualitative and quantitative results of the system’s performance and discuss possible areas of application.
... In comparison to the KF and EKF approaches, the PF approach is more robust for nonlinear models as it can approach the Bayesian optimal estimate with a sufficiently large number of particles [15]. It has been widely employed for speaker tracking problems [16], [17], [18]. For example, in [16] and [17], PF is used to fuse object shapes and audio information. ...
... It has been widely employed for speaker tracking problems [16], [17], [18]. For example, in [16] and [17], PF is used to fuse object shapes and audio information. In [18], independent audio and video observation models are fused for simultaneous tracking and detection of multiple speakers. ...
... Data association algorithms with Bayesian methods and PHD filter in target tracking applications can be found in [7], [29], [30], [31] and [32]. However, some researchers found that classical data association algorithms are computationally expensive, and this led them to fuse multi-modal measurements inside their proposed framework [11], [14], [16], [17], [20] as we also do here. ...
... A common approach is to characterize the human movement dynamics as a Langevin process [25], since it is reasonably simple and has been proven to work well in practical applications [19,25]. In this case, the state variable, x k , is defined as: ...
... A common approach is to characterize the human movement dynamics as a Langevin process [25], since it is reasonably simple and has been proven to work well in practical applications [19,25]. In this case, the state variable, x k , is defined as: ...
Article
Full-text available
This paper addresses the problem of three-dimensional speaker orientation estimation in a smart-room environment equipped with microphone arrays. A Bayesian approach is proposed to jointly track the location and orientation of an active speaker. The main motivation is that the knowledge of the speaker orientation may yield an increased localization performance and vice versa. Assuming that the sound produced by the speaker is originated from his mouth, the center of the head is deduced based on the estimated head orientation. Moreover, the elevation angle of the head of the speaker can be partly inferred from the fast vertical movements of the computed mouth location. In order to test the performance of the proposed algorithm, a new multimodal dataset has been recorded for this purpose, where the corresponding 3D orientation angles are acquired by an inertial measurement unit (IMU) provided by accelerometers, magnetometers and gyroscopes in the three-axes. The proposed joint algorithm outperforms a two-step approach in terms of localization and orientation angle precision assessing the superiority of the joint approach.
... Audiovisual (AV) scene analysis has become an increasingly popular research topic during the past years due to many useful applications: human-robot interaction [1], multimodal interfaces [2], audio-visual tracking [3], [4], object localization [5], etc. Various attempts to build computational paradigms for AV scene analysis consider the issue of integration as the cornerstone of the approaches. A popular association principle for the auditory and visual data found in the literature is co-localization [2], [6], [7], meaning that observations from different modalities are fused together as if they were generated from the same spatial source. ...
... We note that there are audio-visual sensor configurations, e.g., one camera and an array of microphones, that do not need full spatial calibration. One can estimate the two-dimensional relationship between the image position of a visual feature and an auditory event by mapping sounds onto the image plane [2], [6], or by using a rough estimate of microphone locations relative to the camera [3]. Alternatively one can estimate a calibration function that maps the two-dimensional image coordinates of a visual event to the one-dimensional audio angle of arrival in a linear microphone array [13]. ...
Conference Paper
Full-text available
In this paper we address the problem of aligning visual (V) and auditory (A) data using a sensor that is composed of a camera-pair and a microphone-pair. The original contri-bution of the paper is a method for AV data aligning through estimation of the 3D positions of the microphones in the visual-centred coordinate frame defined by the stereo camera-pair. We exploit the fact that these two distinct data sets are conditioned by a common set of parameters, namely the (unknown) 3D trajectory of an AV object, and derive an EM-like algorithm that alternates between the estimation of the microphone-pair position and the estimation of the AV object trajectory. The proposed algorithm has a number of built-in features: it can deal with A and V observations that are misaligned in time, it estimates the reliability of the data, it is robust to outliers in both modalities, and it has proven theoretical convergence. We report experiments with both simulated and real data.