Fig 1 - uploaded by Afef Saidi
Content may be subject to copyright.
Block diagram of our approach

Block diagram of our approach

Source publication
Conference Paper
Full-text available
Depression is a serious debilitating mental disorder affecting people from all ages all over the world. The number of depression cases increases annually in a continuous way. Due to the complexity of traditional techniques based on clinical diagnosis, there is a need for an automatic detection system of the depression. In this paper we present a no...

Context in source publication

Context 1
... overall process of our proposed approach is detailed in Fig. 1. Fig. 2 (b) shows the general workflow of our proposed approach to detect the depression. This approach is compared with the baseline model (a) based on CNN. ...

Similar publications

Article
Full-text available
Deep learning based methods have recently been successfully explored in hyperspectral image classification field. However, training a deep learning model still requires a large number of labeled samples, which is usually impractical in hyperspectral images. In this paper, a simple but effective feature extraction method is proposed for hyperspectra...

Citations

... FVTC-CNN [35] 0.735 0.656 0.64 Saidi [36] 0.68 0.68 0.68 EmoAudioNet [37] 0.732 0.649 0.653 Solieman [38] 0.66 0.615 0.61 SIMSIAM-S [39] 0.703 --TOAT [40] 0.717 0.429 0.48 SpeechFormer [24] 0.686 0.65 0.694 TFC-SpeechFormer(Ours) 0.762 0.701 0.714 consequently enhancing the feature representation capability. This, in turn, enables the model to better understand the emotional implications within speech signals. ...
Preprint
Full-text available
Speech emotion recognition aims to automatically identify emotions in human speech, enabling machines to understand and engage emotional communication. In recent years, Transformers have demonstrated strong adaptability and significant effectiveness in speech recognition. However, Transformer models are proficient at capturing local features, struggle to extract fine-grained details, and often lead to computational redundancy, increasing time complexity. This paper proposes a Speech Emotion Recognition model named Temporal Fusion Convolution SpeechFormer (TFCS).The model comprises a Hybrid Convolutional Extractor (HCE) and multiple Temporal Fusion Convolution SpeechFormer Blocks (TFCSBs). HCE, consisting of an encoder and convolutional modules, enhances speech signals to capture local features and texture information, extracting frame-level features. TFCSBs utilize a newly proposed Temporal Fusion Convolution Module and Speech Multi-Head Attention Module to capture correlations between adjacent elements in speech sequences. TFCSBs utilize feature information captured by HCE to sequentially form frame, phoneme, word, and sentence structures, and integrates them to establish global and local relationships, enhancing local data capture and computational efficiency. Performance evaluation on the IEMOCAP and DAIC-WOZ datasets demonstrates the effectiveness of HCE in extracting fine-grained local speech features, with TFCS outperforming Transformer and other advanced models overall.
... Competing methods. We compare the proposed HRL with the most popular machine learning methods used for depression detection and classification in LLD research, including support vector machine (SVM) (Kim and Na, 2018;Mousavian et al., 2019;Kambeitz et al., 2017;Saidi et al., 2020), random forest (RF) (Lebedeva et al., 2017) and XGBoost (XGB) (Chen and Guestrin, 2016;Arun et al., 2018a,b;Sharma and Verbeke, 2020)). (1) In the SVM method, an SVM with Radial basis function kernel and regularization parameter = 1.0 is used for classification. ...
... Saidi et al. [32] advanced a hybrid detection model designed for the automated identification of depression, primarily relying on audio-based methods. The model employed a CNN architecture, with a notable departure being the integration of an SVM instead of a fully connected layer for classification. ...
Article
Full-text available
Efficient detection of depression is a challenging scenario in the field of speech signal processing. Since the speech signals provide a better diagnosis of depression, a significant methodology is required for detection. However, manual examination performed by radiologists can be time-consuming and may not be feasible in complex circumstances. Diverse detection methodologies have been proposed previously, but they are found to be less accurate, time-consuming and lead over maximized error rates. The proposed research article presents an effective and automatic deep learning-based depression detection using speech signal data. The steps involved in depression prediction are data acquisition, pre-processing, Feature Extraction, Feature selection and classification. The initial step in depression detection is data acquisition, which aims at collecting speech signals from the Distress Analysis Interview Corpus (DAIC-WOZ) and Sonde Health-free speech (SH2-FS) datasets. The collected data are pre-processed through MS_DWT (Multi-stage Discrete Wavelet Transform) to offer noise-free signals and improved signal quality. The relevant features required for processing the speech signal are extracted through Hilbert Huang (H-H) transform linear prediction cepstrum coefficient (LPCC), fundamental frequency, formants, speaking rate and Mel frequency cepstral coefficients (MFCC). From the extracted features, ideal features required for enhancing the detection accuracy are selected using the Price Auction optimization algorithm (PAOA). Finally, the depression and non-depression states are classified using deep convolutional Attention Cascaded two directional long short-term memory (DAttn_Conv 2D LSTM) with a softmax classifier. The overall accuracy obtained in classifying the depressed and non-depressed classes is 97.82% and 98.91%, respectively.
... Babu et al. (2022) used the CNN technique to classify four rice diseases (Adedoyin et al., 2022). Saidi et al. (2020) employed a CNN-SVM combined approach to detect depression. The CNN-SVM cross-classifier produced a precision rate of 68% using the Distress Analysis Interview Corpus/ Wizard-of-Oz (DAIC-WOZ) dataset. ...
Article
Full-text available
Introduction Paddy leaf diseases have a catastrophic influence on the quality and quantity of paddy grain production. The detection and identification of the intensity of various paddy infections are critical for high-quality crop production. Methods In this paper, infections in paddy leaves are considered for the identification of illness severity. The dataset contains both primary and secondary data. The four online repositories used for secondary data resources are Mendeley, GitHub, Kaggle and UCI. The size of the dataset is 4,068 images. The dataset is first pre-processed using ImageDataGenerator. Then, a generative adversarial network (GAN) is used to increase the dataset size exponentially. The disease severity calculation for the infected leaf is performed using a number of segmentation methods. To determine paddy infection, a deep learning-based hybrid approach is proposed that combines the capabilities of a convolutional neural network (CNN) and support vector machine (SVM). The severity levels are determined with the assistance of a domain expert. Four degrees of disease severity (mild, moderate, severe, and profound) are considered. Results Three infections are considered in the categorization of paddy leaf diseases: bacterial blight, blast, and leaf smut. The model predicted the paddy disease type and intensity with a 98.43% correctness rate. The loss rate is 41.25%. Discussion The findings show that the proposed method is reliable and effective for identifying the four levels of severity of bacterial blight, blast, and leaf smut infections in paddy crops. The proposed model performed better than the existing CNN and SVM classification models.
... In the existing literature, Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) algorithms were the most commonly used methods due to their ability to generate more accurate predictions on existing mental health datasets [16,17,18,19]. For instance, SVM could classify depressive behaviors on audio-based datasets [20]. KNN was utilized to analyze electroencephalogram (EEG) signals to identify shifts in depressive behaviors by grouping feature spaces into clusters [21]. ...
Conference Paper
Full-text available
Mental health such as depression issues are on an unfortunate rise. Analyzing signals generated from sensors using machine learning (ML) techniques has shown promise as an objective, holistic, and cost-effective tool for assessing human mental conditions. However, many sensors often expose individuals' Personally Identifiable Information (PII). In this work-in-progress study, we explore the performance of ML-based models for depression detection using only non-PII-releasing sensors. We compared the performance of prevailing ML models on the StudentLife dataset, which contains sensor signal data from smartphones for 48 students across 65 days. Our findings suggest that Decision Tree, Gradient Boosting, and Logistic Regression show higher predictive power even with limited non-PII-releasing sensor data. Additionally, we found that certain combinations of non-PII releasing sensors, such as surrounding conversation and phone lock frequency, help ML classifiers attain almost perfect performance across accuracy, recall, precision, and F1. This study has implications for a privacy-preserving approach for detecting depression.
... Mean accuracy of 87.55% was reported using random tree models with 100 trees. Additionally, Saidi et al. employed a hybrid model combining CNN and SVM to detect depression [155]. features extracted from DAIC-WOZ speech samples, for depression analysis [156]. ...
Article
Full-text available
Speech carries essential information about the speaker’s physiology and possible pathophysiological conditions. Bio-acoustic voice qualities show promising value for characterizing mood disorders. Depression alters several bio-acoustic features of speech, and by measuring and analyzing those, conventional diagnostic tools could be enhanced, and clinical support improved. Here, we review the use of speech as an objective biomarker of depression. We briefly review the speech production process and acoustic theory of voice production, and explore the most commonly quantified bio-acoustic characteristics and their correlation with de- pression. We highlight the e�ect of depression on speech production and bio-acoustic speech characteristics and conclude with a summary of speech-based studies that suggest that depression diagnostics could be augmented by speech. Advances in computerized speech processing allow for an objective analysis of speech and the classification of speakers’ conditions through machine learning. Encouraging early results suggest a future role for depression screening in the clinic.
... They use deep spectral features extracted from pre-trained Visual Geometry Group (VGG-16) networks for speech processing, a Gate Convolutional Neural Network (GCNN) consisting of LSTM layers, and BERT for text embedding, and use CNNs consisting of LSTM layers. In addition, Saidi et al. [22] propose a novel method for the automated detection of depression using an audio-based hybrid model. The model uses a CNN for automatic feature extraction and an SVM for classification. ...
Article
Full-text available
This paper introduces the Are u Depressed (AuD) model, which aims to detect depressive emotional intensity and classify detailed depressive symptoms expressed in user utterances. The study includes the creation of a BWS dataset using a tool for the Best-Worst Scaling annotation task and a DSM-5 dataset containing nine types of depression annotations based on major depressive disorder (MDD) episodes in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5). The proposed model employs the DistilBERT model for both tasks and demonstrates superior performance compared to other machine learning and deep learning models. We suggest using our model for real-time depressive emotion detection tasks that demand speed and accuracy. Overall, the AuD model significantly advances the accurate detection of depressive emotions in user utterances.
... Hence, the method is referred to as the "voted t-test" as the test was conducted on multiple groups sampled from the original population. This database provides recordings, in the form of audio, video, and psychiatric responses in text form [24]. To evaluate the severity of the patient's depression, interviews were conducted using the PHQ-8. ...
Article
Full-text available
Major Depressive Disorder (MDD) has been known as one of the most prevalent mental disorders whose symptoms can be observed from changes in facial behaviors. Previous studies had attempted to build Machine Learning (ML) models to assess depression severity using such features but few have utilized these models to determine key facial behaviors for MDD. In this study, we used video data to assess the severity of MDD and determine important features based on three approaches (XGBoost, Spearman's correlation, and t-test). In addition, there is the Facial Action Coding System (FACS) framework that allows visual data such as changes in facial behavior to be modeled as time series data. The results show that the XGBoost model obtained the best results when trained using features selected through the t-test statistical method with 5.387 MAE, 6.266 RMSE, and 0.042 R 2. The majority of the important features consist of Action Unit (AU) and Features 3D around the regions of the left eye, right cheek, and lip area. However, the majority of the important features discovered from the three approaches, are the first derivatives of the 3D facial landmark coordinates of the cheeks, eyes, and lips, especially along the z-axis. However, the variables used in this research are limited to the first derivatives, which meant that usages of wider variations of facial behavior data may further be studied so that Computer-Aided Diagnosis (CAD) systems for mental disorders may be realized in the future.
... The number of nondepressed subjects was about four times bigger than that of depressed ones in both training and development parts. The articles namely, [3], [21], [20], [22], [13], [23] and [5] used the DAIC-WOZ dataset. ...
... Spectrograms [22] overcame the issue of class imbalance by cropping the spectrograms of each participant into 4 second slices. Then, participants were randomly sampled in equal proportion from each class depressed and not depressed. ...
Article
Full-text available
Depression is a prevailing mental disturbance affecting an individual’s thinking and mental development. There has been much research demonstrating effective automated prediction and detection of Depression.Many datasets used suffer from class imbalance where samples of a dominant class outnumber the minority class that is to be detected. This review paper uses the PRISMA review methodology to enlist different class imbalance handling techniques used in Depression prediction and detection research. The articles were taken from information technology databases. The research gap was found that under sampling methods were few for predicting and detecting Depression and regression modelling could be considered for future research. The results also revealed that the common data level technique is SMOTE as a single method and the common ensemble method is SMOTE, oversampling and under sampling techniques. The model level consisted of various algorithms that can be used to tackle the class imbalance problem.
... SVM and RF were used for depression classification not only on low-level descriptors (LLD) and related functionals in Tasnim and Stroulia (2019) but also on i-vectors in Xing et al. (2022). On the other hand, the results of Saidi et al. (2020), comparing the baseline CNN model with the model combining CNN and SVM, have shown that the SVM classifier improved the classification accuracy. An exploratory study (Espinola et al., 2021), which compared experimental results of MLP, Logistic Regression (LR), RF, Bayes Network, Naïve Bayes, and SVM with different kernels, concluded that RF provided the highest accuracy among all classifiers for MDD detection. ...
Article
Full-text available
Introduction As a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects’ speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data for deep learning, and the variable length of speech frame-level features have an impact on the recognition performance. Methods The above problems, this study proposes a multi-task ensemble learning method based on speaker embeddings for depression classification. First, we extract the Mel Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Predictive Coefficients (PLP), and the Filter Bank (FBANK) from the out-domain dataset (CN-Celeb) and train the Resnet x-vector extractor, Time delay neural network (TDNN) x-vector extractor, and i-vector extractor. Then, we extract the corresponding speaker embeddings of fixed length from the depression speech database of the Gansu Provincial Key Laboratory of Wearable Computing. Support Vector Machine (SVM) and Random Forest (RF) are used to obtain the classification results of speaker embeddings in nine speech tasks. To make full use of the information of speech tasks with different scenes and emotions, we aggregate the classification results of nine tasks into new features and then obtain the final classification results by using Multilayer Perceptron (MLP). In order to take advantage of the complementary effects of different features, Resnet x-vectors based on different acoustic features are fused in the ensemble learning method. Results Experimental results demonstrate that (1) MFCC-based Resnet x-vectors perform best among the nine speaker embeddings for depression detection; (2) interview speech is better than picture descriptions speech, and neutral stimulus is the best among the three emotional valences in the depression recognition task; (3) our multi-task ensemble learning method with MFCC-based Resnet x-vectors can effectively identify depressed patients; (4) in all cases, the combination of MFCC-based Resnet x-vectors and PLP-based Resnet x-vectors in our ensemble learning method achieves the best results, outperforming other literature studies using the depression speech database. Discussion Our multi-task ensemble learning method with MFCC-based Resnet x-vectors can fuse the depression related information of different stimuli effectively, which provides a new approach for depression detection. The limitation of this method is that speaker embeddings extractors were pre-trained on the out-domain dataset. We will consider using the augmented in-domain dataset for pre-training to improve the depression recognition performance further.