The structure of the Transformer Encoder employed to extract temporal information of sequences. After the data processed, the data are fed to the Transformer Encoder to extract temporal information. A single Transformer Encoder layer is composed by a Multi-Head Attention module and a Feed-Forward module with an external Positional Encoding module.

The structure of the Transformer Encoder employed to extract temporal information of sequences. After the data processed, the data are fed to the Transformer Encoder to extract temporal information. A single Transformer Encoder layer is composed by a Multi-Head Attention module and a Feed-Forward module with an external Positional Encoding module.

Source publication
Article
Full-text available
Depression is a severe psychological condition that affects millions of people worldwide. As depression has received more attention in recent years, it has become imperative to develop automatic methods for detecting depression. Although numerous machine learning methods have been proposed for estimating the levels of depression via audio, visual,...

Context in source publication

Context 1
... Transformer Encoder structure employed in this work has been detailed in Figure 2. Following [8], we use the naive Transformer Encoder structure, along with the Positional Encoding module, Multi-head Attention module, and Feed-Forward module in our work. ...

Similar publications

Article
Full-text available
In this paper, general regression neural network (GRNN) with the input feature of Mel-frequency cepstrum coefficient (MFCC) is employed to automatically recognize the calls of leopard, ross, and weddell seals with widely overlapping living areas. As a feedforward network, GRNN has only one network parameter, i.e., spread factor. The recognition per...

Citations

... We aim to further fuse spatio-temporal information from unimodal data in social network data to detect depression. In contrast to previous methods [9], [17], [18], [19], we propose an end-to-end lightweight Multimodal Spatio-Temporal Attention Transformer approach for Depression detection (named: DepMSTAT). The proposed Spatio-Temporal Attentional Transformer (STAT) module can be used to extract and fuse temporal and spatial information from facial and speech data. ...
... humour detection [35], product detection [36], data alignment in time series modelling of human behaviour [37], dance choreography [38], and so on. The transformer encoder was first applied in [17] to extract long-term temporal context information from long sequences of audio and visual data on depression. Guo et al. [18] proposed a multimodal depression detection method based on the TOpic Attentive Transformer (TOAT) by introducing a transformer pre-training model. ...
... For multimodal data fusion studies, gated recurrent unit (GRU) is used to extract features from text, speech and video data to identify depression [16]. Recently, some studies have introduced transformer encoders to depression recognition tasks [17], [18]. In our previous work [9], by embedding the spatio-temporal features in the matrix V, the attention mechanism was guided to learn the global information efficiently. ...
Article
Full-text available
Depression is one of the most common mental illnesses, but few of the currently proposed in-depth models based on social media data take into account both temporal and spatial information in the data for the detection of depression. In this paper, we present an efficient, low-covariance multimodal integrated spatio-temporal converter framework called DepMSTAT, which aims to detect depression using acoustic and visual features in social media data. The framework consists of four modules: a data preprocessing module, a token generation module, a Spatial-Temporal Attentional Transformer (STAT) module, and a depression classifier module. To efficiently capture spatial and temporal correlations in multimodal social media depression data, a plug-and-play STAT module is proposed. The module is capable of extracting unimodal spatio-temporal features and fusing unimodal information, playing a key role in the analysis of acoustic and visual features in social media data. Through extensive experiments on a depression database (D-Vlog), the method in this paper shows high accuracy (71.53%) in depression detection, achieving a performance that exceeds most models. This work provides a scaffold for studies based on multimodal data that assists in the detection of depression.
... EEG is a technique that records the electrical activity of the brain using metal electrodes placed on the scalp. The resulting signals are classified into different frequency bands, including Alpha (8-12 Hz), Beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), Gamma (> 30 Hz), Theta (4)(5)(6)(7)(8), and Delta (< 4 Hz). Alpha and beta are linked to stress or negative emotions. ...
Article
Full-text available
The human body responds to challenging events or demanding conditions by entering an escalated psycho-physiological state known as stress. Psychological stress can be captured through physiological signals (ECG, EEG, EDA, GSR, etc.) and behavioral signals (eye gaze, head movement, pupil diameter, etc.) Multiple stressors occurring simultaneously can cause adverse mental and physical health effects, leading to chronic health issues. Continuous monitoring of stress using wearable devices promises real-time data collection, enabling early detection and prevention of stress-related issues. The present study presents an extensive overview of stress detection techniques utilizing wearable sensors, and machine & deep learning methods. Various types of machine learning algorithms, such as logistic regression, support vector machines, random forest, and deep learning models, have been applied to analyze data and detect stress. Additionally, the paper highlights the stress detection datasets and research conducted in different areas. The challenges and obstacles related to detecting stress using machine intelligence are also addressed in the paper.
... Studies [5]- [7] have been undertaken to facilitate therapists in their diagnostic process by proposing methods for the automatic assessment of depression levels through multimodal features, including visual cues, acoustic signals, body movements etc. Liu et al. used GRU-based model to extract temporal information from audio and visual modalities and employed average late fusion to combine them [8]. Sun et al. used the Transformer encoder to capture long-term sequences and proposed an adaptive late fusion strategy for Copyright © 2024 The Institute of Electronics, Information and Communication Engineers combining results from different modalities [9]. Fang et al. proposed a multimodal fusion model with multi-level attention mechanisms to integrate audio, visual, and text modalities in order to simultaneously learn depression features [10]. ...
... Rodrigues et al. [18] proposed a multimodal fusion approach using BERT-CNN and Gated CNN representations. Liu et al. [8] employed a GRU model to capture aural and visual features and fused them using an average late fusion method while Sun et al. [9] proposed a transformer-based adaptive late fusion method. Recently, Wu et al. incorporated emotion as an additional modality, along with aural and text modalities, to enhance the robustness of depression detection [14]. ...
... The baseline involves mixed training on data from three different emotional stimuli, culminating in modality fusion via Intra fusion, without utilizing inter-emotional stimulus fusion, which is akin to conventional interview experiments. Inspired by the work of Sun et al. [9], we selected a subset of features that demonstrate good performance in unimodal experiments and were less likely to have side effects or slow down the training. ...
Article
Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.
... For instance, the acoustic modality incorporates MFCC, eGeMaps, alongside deep features derived from VGG [33] and DenseNet [34]. Past investigations [22] by Hao et al. highlighted the discriminative power of MFCC and AU-poses in acoustic and visual modalities respectively. Hence, for streamlined and efficient analysis, we solely utilize MFCC and AU-poses features for depression detection. ...
Article
Full-text available
Multimodal sentiment analysis (MSA) is an emerging field focused on interpreting complex human emotions and expressions by integrating various data types, including text, audio, and visuals. Addressing the challenges in this area, we introduce SentDep, a groundbreaking framework that merges cutting-edge fusion methods with modern deep learning structures. Designed to effectively blend the unique features of textual, acoustic, and visual data, SentDep offers a unified and potent representation of multimodal data. Our extensive tests on renowned datasets like CMU-MOSI and CMU-MOSEI demonstrate that SentDep surpasses current leading models, setting a new standard in MSA performance. We conducted thorough ablation studies and supplementary experiments to identify what drives SentDep’s success. These studies highlight the importance of the size of pre-training data, the effectiveness of various fusion techniques, and the critical role of temporal information in enhancing the model’s capabilities.
... They succeed in obtaining a concordance correlation coefficient (CCC) of 0.67 using a multi-model of three modalities. With a CCC of 0.733, the multi-transformers model is also applied to the E-DAIC data set [19]. The input of the multi-transformers model is the voice and facial features. ...
... We intend to ensure that our Window Block LSTM performs well on other data sets. Therefore, we tested our model with additional data, which is shown in Appendix A. We experimented with Bi-LSTM [18] and transformer models [19] in our data set to evaluate our model using solely facial features. Compared to baseline, experimental versions of the Bi-LSTM and Window Block LSTM models perform better, as shown in Table 6. ...
Article
Full-text available
Machine learning is used for a fast pre-diagnosis approach to prevent the effects of Major Depressive Disorder (MDD). The objective of this research is to detect depression using a set of important facial features extracted from interview video, e.g., radians, gaze at angles, action unit intensity, etc. The model is based on LSTM with an attention mechanism. It aims to combine those features using the intermediate fusion approach. The label smoothing was presented to further improve the model’s performance. Unlike other black-box models, the integrated gradient was presented as the model explanation to show important features of each patient. The experiment was conducted on 474 video samples collected at Chulalongkorn University. The data set was divided into 134 depressed and 340 non-depressed categories. The results showed that our model is the winner, with a 88.89% F1-score, 87.03% recall, 91.67% accuracy, and 91.40% precision. Moreover, the model can capture important features of depression, including head turning, no specific gaze, slow eye movement, no smiles, frowning, grumbling, and scowling, which express a lack of concentration, social disinterest, and negative feelings that are consistent with the assumptions in the depressive theories.
... These techniques have captured the attention of researchers due to their ability to learn from and make predictions based on data (Ma et al., 2016, Trotzek et al., 2017, Dinkel et al., 2017. More recently, the incorporation of multi-modal approaches (Yin et al. 2019, Rodrigues et al., 2019, Ray et al. 2019, Sun et al., 2021 has marked a significant breakthrough in the field. These approaches amalgamate multiple types of data-like audio, text, and video features-to provide a more holistic understanding of the subject. ...
Conference Paper
Early detection of depression is crucial for the effective treatment of patients. Given the inefficiency of current depression screening methods, exploring depression recognition techniques remains a complex problem with significant implications. This study introduces a novel audio and text-based deep neural network model for detecting depression, utilizing data from 189 subjects in the DAIZ-WOZ corpus. Notably, this study presents a deep learning model (CLAM model) that incorporates CNN, LSTM, and Attention Layer. This model significantly outperforms other established models in both unimodal and multimodal domains, enhancing the accuracy of depression recognition. Our comparative analysis underscores the superiority of multi-task representation learning over single-task approaches. The model achieved a notable accuracy score of 0.77 and an F1-Score of 0.69 on the dataset. The contribution is twofold. Firstly, it proposes and validates the robustness of deep neural network with CNN, LSTM, and attention layers, offering innovative experimental perspectives for explainable research exploration. Secondly, it introduces an efficient multimodal fusion model for depression recognition , establishing a robust foundation for the subsequent integration of additional modalities in the future.
... World Health Organization (WHO) has stated that there are more than 300 million people affected by MDD disorders [4]. There are three factors that influence MDD disorders, namely psychological, genetic, and socio-environmental factors [5]. MDD can interfere with a person's mental health and have risks that can lead to loss of life. ...
Conference Paper
Major Depressive Disorder (MDD) is a prevalent mental disorder, affecting a significant number of individuals, with estimates reaching 300 million cases worldwide. Currently, the diagnosis of this condition relies heavily on subjective assessments based on the experience of medical professionals. Therefore, researchers have turned to deep learning models to explore the detection of depression. The objective of this review is to gather information on detecting depression based on facial expressions in videos using deep learning techniques. Overall, this research found that RNN models achieved 7.22 MAE for AVEC2014. LSTM models produced 4.83 MAE for DAIC-WOZ, while GRU models achieved an accuracy of 89.77% for DAIC-WOZ. Features like Facial Action Units (FAU), eye gaze, and landmarks show great potential and need to be further analyzed to improve results. Analysis can include applying feature engineering techniques. Aggregation methods, such as mean calculation, are recommended as effective approaches for data processing. This Systematic Literature Review found that facial expressions do have relevant patterns related to MDD.
... In recent years, transformer-based methods have played an irreplaceable role in depression detection research. However, most transformer-based methods do not take into account the spatio-temporal information in the data stream and the fusion of information between different modalities, which limits the capture of high-level semantic information in the data stream [22,23,24]. For example, although the correlation between key points in a sequence of facial landmarks is obvious, this spatial information is ignored in many studies compared to the temporal information. ...
... Cross-transformer modeling is achieved by fusing single-modality information with each other by obtaining query (Q), key (K), and value (V) from different single-modality data. In order to capture the long-term contextual information in long-sequence audio/video of depression, a transformer encoder was applied to the depression detection task for the first time with excellent results [22]. Guo et al. [24] designed a twobranch transformer structure called Topic-Attentive Transformer-based (TOAT) for depression detection. ...
... In [47], an auditory saliency mechanism was studied and applied in a Transformer to adjust the feature embeddings. Additionally, Transformer has been employed for Alzheimer's disease (AD) detection [48], [49] and depression classification [50]. For example, Ilias et al. [48] utilized a pretrained vision Transformer [27] to extract acoustic features and achieved remarkable results for dementia detection. ...
... Both [48] and [49] encouraged the use of pretrained models. In [50], a Transformer-based network was utilized to extract long-term temporal context information for depression estimation. Based on Transformer, various selfsupervised speech representation learning approaches have also been proposed, including wav2vec [51], wav2vec 2.0 [52] and HuBERT [53]. ...
Preprint
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.
... The existing multimodal learning works either focus on improving the feature extraction for different modalities, or designing better fusion strategies. To extract features for long-time sequential data, some classical models such as LSTM and CNN are adopted in [1,2,3], while more recent approaches employ transformer to alleviate the forgeting problem caused by the long duration of patient interview [4]. In regard to the modality fusion, in addition to the simple concatenation, attention mechanism is applied to adjust the modality contribution [5]. ...
... If not specified, all modalities are utilized. Model CCC RMSE MAE Baseline [1] 0.111 6.37 / Adaptive Transformer [4] 0.331 / 6.22 Bert-CNN & Gated-CNN [2] 0.403 6.11 / VarFilter+Midimax (Ours) (A) 0.434 5.44 4.4 Hierarchical BiLSTM [13] 0.442 5.50 / DepressNet [5] 0.457 5.36 / multi-DAAE+Fusion [11] 0.528 4.47 / PV-DM [11] ...
Preprint
Full-text available
Depression is a leading cause of death worldwide, and the diagnosis of depression is nontrivial. Multimodal learning is a popular solution for automatic diagnosis of depression, and the existing works suffer two main drawbacks: 1) the high-order interactions between different modalities can not be well exploited; and 2) interpretability of the models are weak. To remedy these drawbacks, we propose a multimodal multi-order factor fusion (MMFF) method. Our method can well exploit the high-order interactions between different modalities by extracting and assembling modality factors under the guide of a shared latent proxy. We conduct extensive experiments on two recent and popular datasets, E-DAIC-WOZ and CMDC, and the results show that our method achieve significantly better performance compared with other existing approaches. Besides, by analyzing the process of factor assembly, our model can intuitively show the contribution of each factor. This helps us understand the fusion mechanism.