The structure of the Transformer Encoder employed to extract temporal information of sequences. After the data processed, the data are fed to the Transformer Encoder to extract temporal information. A single Transformer Encoder layer is composed by a Multi-Head Attention module and a Feed-Forward module with an external Positional Encoding module.

Source publication

Figure 1. Overview of our proposed method. The origin data is firstly...

Figure 2. The structure of the Transformer Encoder employed to extract...

Figure 3. The distribution of the training and development set of the...

Figure 4. Correlation graph between the predicted and true PHQ-8...

Figure 5. The CCC scores for different number of top M modalities...

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Article

Full-text available

Jul 2021

Depression is a severe psychological condition that affects millions of people worldwide. As depression has received more attention in recent years, it has become imperative to develop automatic methods for detecting depression. Although numerous machine learning methods have been proposed for estimating the levels of depression via audio, visual,...

Context 1

... Transformer Encoder structure employed in this work has been detailed in Figure 2. Following [8], we use the naive Transformer Encoder structure, along with the Positional Encoding module, Multi-head Attention module, and Feed-Forward module in our work. ...

View in full-text

The whole structure of the proposed method

Distribution maps of leopard, ross, and weddell seals [24]

Geographic regions collected by the Watkins database [25]

Seal call recognition based on general regression neural network using Mel-frequency cepstrum coefficient features

Article

Full-text available

May 2023

In this paper, general regression neural network (GRNN) with the input feature of Mel-frequency cepstrum coefficient (MFCC) is employed to automatically recognize the calls of leopard, ross, and weddell seals with widely overlapping living areas. As a feedforward network, GRNN has only one network parameter, i.e., spread factor. The recognition per...

DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection

Article

Full-text available

Jul 2024
IEEE T KNOWL DATA EN

Depression is one of the most common mental illnesses, but few of the currently proposed in-depth models based on social media data take into account both temporal and spatial information in the data for the detection of depression. In this paper, we present an efficient, low-covariance multimodal integrated spatio-temporal converter framework called DepMSTAT, which aims to detect depression using acoustic and visual features in social media data. The framework consists of four modules: a data preprocessing module, a token generation module, a Spatial-Temporal Attentional Transformer (STAT) module, and a depression classifier module. To efficiently capture spatial and temporal correlations in multimodal social media depression data, a plug-and-play STAT module is proposed. The module is capable of extracting unimodal spatio-temporal features and fusing unimodal information, playing a key role in the analysis of acoustic and visual features in social media data. Through extensive experiments on a depression database (D-Vlog), the method in this paper shows high accuracy (71.53%) in depression detection, achieving a performance that exceeds most models. This work provides a scaffold for studies based on multimodal data that assists in the detection of depression.

Issues and Challenges in Detecting Mental Stress from Multimodal Data Using Machine Intelligence

Article

Full-text available

Mar 2024

The human body responds to challenging events or demanding conditions by entering an escalated psycho-physiological state known as stress. Psychological stress can be captured through physiological signals (ECG, EEG, EDA, GSR, etc.) and behavioral signals (eye gaze, head movement, pupil diameter, etc.) Multiple stressors occurring simultaneously can cause adverse mental and physical health effects, leading to chronic health issues. Continuous monitoring of stress using wearable devices promises real-time data collection, enabling early detection and prevention of stress-related issues. The present study presents an extensive overview of stress detection techniques utilizing wearable sensors, and machine & deep learning methods. Various types of machine learning algorithms, such as logistic regression, support vector machines, random forest, and deep learning models, have been applied to analyze data and detect stress. Additionally, the paper highlights the stress detection datasets and research conducted in different areas. The challenges and obstacles related to detecting stress using machine intelligence are also addressed in the paper.

An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection

Article

Mar 2024
IEICE T INF SYST

Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.

SentDep: Pioneering Fusion-Centric Multimodal Sentiment Analysis for Unprecedented Performance and Insights

Article

Full-text available

Jan 2024

Multimodal sentiment analysis (MSA) is an emerging field focused on interpreting complex human emotions and expressions by integrating various data types, including text, audio, and visuals. Addressing the challenges in this area, we introduce SentDep, a groundbreaking framework that merges cutting-edge fusion methods with modern deep learning structures. Designed to effectively blend the unique features of textual, acoustic, and visual data, SentDep offers a unified and potent representation of multimodal data. Our extensive tests on renowned datasets like CMU-MOSI and CMU-MOSEI demonstrate that SentDep surpasses current leading models, setting a new standard in MSA performance. We conducted thorough ablation studies and supplementary experiments to identify what drives SentDep’s success. These studies highlight the importance of the size of pre-training data, the effectiveness of various fusion techniques, and the critical role of temporal information in enhancing the model’s capabilities.

Explainable Depression Detection Based on Facial Expression Using LSTM on Attentional Intermediate Feature Fusion with Label Smoothing

Article

Full-text available

Nov 2023
SENSORS-BASEL

Machine learning is used for a fast pre-diagnosis approach to prevent the effects of Major Depressive Disorder (MDD). The objective of this research is to detect depression using a set of important facial features extracted from interview video, e.g., radians, gaze at angles, action unit intensity, etc. The model is based on LSTM with an attention mechanism. It aims to combine those features using the intermediate fusion approach. The label smoothing was presented to further improve the model’s performance. Unlike other black-box models, the integrated gradient was presented as the model explanation to show important features of each patient. The experiment was conducted on 474 video samples collected at Chulalongkorn University. The data set was divided into 134 depressed and 340 non-depressed categories. The results showed that our model is the winner, with a 88.89% F1-score, 87.03% recall, 91.67% accuracy, and 91.40% precision. Moreover, the model can capture important features of depression, including head turning, no specific gaze, slow eye movement, no smiles, frowning, grumbling, and scowling, which express a lack of concentration, social disinterest, and negative feelings that are consistent with the assumptions in the depressive theories.

Modelling and Analysis of a Multi-Modal Deep Learning Framework for Depression Recognition

Conference Paper

Nov 2023

Yuxin Li

Early detection of depression is crucial for the effective treatment of patients. Given the inefficiency of current depression screening methods, exploring depression recognition techniques remains a complex problem with significant implications. This study introduces a novel audio and text-based deep neural network model for detecting depression, utilizing data from 189 subjects in the DAIZ-WOZ corpus. Notably, this study presents a deep learning model (CLAM model) that incorporates CNN, LSTM, and Attention Layer. This model significantly outperforms other established models in both unimodal and multimodal domains, enhancing the accuracy of depression recognition. Our comparative analysis underscores the superiority of multi-task representation learning over single-task approaches. The model achieved a notable accuracy score of 0.77 and an F1-Score of 0.69 on the dataset. The contribution is twofold. Firstly, it proposes and validates the robustness of deep neural network with CNN, LSTM, and attention layers, offering innovative experimental perspectives for explainable research exploration. Secondly, it introduces an efficient multimodal fusion model for depression recognition , establishing a robust foundation for the subsequent integration of additional modalities in the future.

Exploring Recurrent Neural Network Models for Depression Detection Through Facial Expressions: A Systematic Literature Review

Conference Paper

Sep 2023

Major Depressive Disorder (MDD) is a prevalent mental disorder, affecting a significant number of individuals, with estimates reaching 300 million cases worldwide. Currently, the diagnosis of this condition relies heavily on subjective assessments based on the experience of medical professionals. Therefore, researchers have turned to deep learning models to explore the detection of depression. The objective of this review is to gather information on detecting depression based on facial expressions in videos using deep learning techniques. Overall, this research found that RNN models achieved 7.22 MAE for AVEC2014. LSTM models produced 4.83 MAE for DAIC-WOZ, while GRU models achieved an accuracy of 89.77% for DAIC-WOZ. Features like Facial Action Units (FAU), eye gaze, and landmarks show great potential and need to be further analyzed to improve results. Analysis can include applying feature engineering techniques. Aggregation methods, such as mean calculation, are recommended as effective approaches for data processing. This Systematic Literature Review found that facial expressions do have relevant patterns related to MDD.

Depressive semantic awareness from vlog facial and vocal streams via spatio-temporal transformer

Article

Full-text available

Mar 2023

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Preprint

Feb 2023

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.

Depression Diagnosis and Analysis via Multimodal Multi-order Factor Fusion

Preprint

Full-text available

Dec 2022

Depression is a leading cause of death worldwide, and the diagnosis of depression is nontrivial. Multimodal learning is a popular solution for automatic diagnosis of depression, and the existing works suffer two main drawbacks: 1) the high-order interactions between different modalities can not be well exploited; and 2) interpretability of the models are weak. To remedy these drawbacks, we propose a multimodal multi-order factor fusion (MMFF) method. Our method can well exploit the high-order interactions between different modalities by extracting and assembling modality factors under the guide of a shared latent proxy. We conduct extensive experiments on two recent and popular datasets, E-DAIC-WOZ and CMDC, and the results show that our method achieve significantly better performance compared with other existing approaches. Besides, by analyzing the process of factor assembly, our model can intuitively show the contribution of each factor. This helps us understand the fusion mechanism.

Context in source publication

Similar publications

Citations