Saqlain Hussain Shah's research while affiliated with University of Engineering and Technology, Taxila and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (2)


Fig. 2. (a) Independent modality-specific embedding networks are leverage for off-the-shelf feature extraction (Box) The proposed Two-branch Model with independent modality-specific FC layers. Element-wise multiplication is used for Fusion of two branches. (b) During testing phase only audio data is used. The visual data is set to 0. Features of audio samples from training and testing splits are extracted. Later on, an SVM is trained on these features to report % accuracy.
Speaker Recognition in Realistic Scenario Using Multimodal Data
  • Preprint
  • File available

February 2023

·

54 Reads

Saqlain Hussain Shah

·

·

·

In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice.

Download
Share

Citations (1)


... Moreover, these datasets provide synchronize visual information which is instrumental in bringing face-voice (F-V) association tasks to the vision community by learning joint representations for multimodal applications such as cross-modal verification, matching, and retrieval [2,3,[10][11][12]. Consequently, the F-V association task has gained significant research interest [10][11][12][13][14][15][16][17][18]. Moreover, it fueled the creation of new F-V association datasets to study this task. ...

Reference:

Multimodal pre-train then transfer learning approach for speaker recognition
Speaker Recognition in Realistic Scenario Using Multimodal Data
  • Citing Conference Paper
  • February 2023