A spectrogram example and its visualization results using gradient-based localization to predict the grade score in second and third residual block of EfficientNet. (A) Spectrogram with heatmap visualization of a patient with high-degree grade score. (B) A patient with normal grade.

A spectrogram example and its visualization results using gradient-based localization to predict the grade score in second and third residual block of EfficientNet. (A) Spectrogram with heatmap visualization of a patient with high-degree grade score. (B) A patient with normal grade.

Source publication
Article
Full-text available
Despite the lack of findings in laryngeal endoscopy, it is common for patients to undergo vocal problems after thyroid surgery. This study aimed to predict the recovery of the patient’s voice after 3 months from preoperative and postoperative voice spectrograms. We retrospectively collected voice and the GRBAS score from 114 patients undergoing sur...

Contexts in source publication

Context 1
... visualize the important features of the scores in the spectrograms, an activation heatmap was constructed using the second and third residual block of the EfficientNet model ( Figure 5). The voice spectrogram of a patient with high-degree grade was shown in Figure 5A. ...
Context 2
... visualize the important features of the scores in the spectrograms, an activation heatmap was constructed using the second and third residual block of the EfficientNet model ( Figure 5). The voice spectrogram of a patient with high-degree grade was shown in Figure 5A. This spectrogram is from one of the patients with very poor voice quality. ...
Context 3
... the heatmap visualization through Grad-CAM, the highlighted part shows an important imaging feature to predict the grade score. The spectrogram of a normal grade voice is shown in Figure 5B. In this patient, the amplitude widely spreads over various frequencies. ...
Context 4
... visualize the important features of the scores in the spectrograms, an activation heatmap was constructed using the second and third residual block of the EfficientNet model ( Figure 5). The voice spectrogram of a patient with high-degree grade was shown in Figure 5A. ...
Context 5
... visualize the important features of the scores in the spectrograms, an activation heatmap was constructed using the second and third residual block of the EfficientNet model ( Figure 5). The voice spectrogram of a patient with high-degree grade was shown in Figure 5A. This spectrogram is from one of the patients with very poor voice quality. ...
Context 6
... the heatmap visualization through Grad-CAM, the highlighted part shows an important imaging feature to predict the grade score. The spectrogram of a normal grade voice is shown in Figure 5B. In this patient, the amplitude widely spreads over various frequencies. ...

Citations

... That model was evaluated with an accuracy of 86%, and the decision of the model to classify normal and abnormal sounds in the time domain of the heartbeat was confirmed using Grad CAM for the results. Lee [18] conducted a study to predict the recovery of patients who underwent thyroid cancer surgery by converting their speech sounds before and after surgery into spectrograms. The training involved the use of EfficientNet and LSTM, with patients and vocal data provided by DIRAMS. ...
Article
Full-text available
This study aims to establish a greater reliability compared to conventional speech emotion recognition (SER) studies. This is achieved through preprocessing techniques that reduce uncertainty elements, models that combine the structural features of each model, and the application of various explanatory techniques. The ability to interpret can be made more accurate by reducing uncertain learning data, applying data in different environments, and applying techniques that explain the reasoning behind the results. We designed a generalized model using three different datasets, and each speech was converted into a spectrogram image through STFT preprocessing. The spectrogram was divided into the time domain with overlapping to match the input size of the model. Each divided section is expressed as a Gaussian distribution, and the quality of the data is investigated by the correlation coefficient between distributions. As a result, the scale of the data is reduced, and uncertainty is minimized. VGGish and YAMNet are the most representative pretrained deep learning networks frequently used in conjunction with speech processing. In dealing with speech signal processing, it is frequently advantageous to use these pretrained models synergistically rather than exclusively, resulting in the construction of ensemble deep networks. And finally, various explainable models (Grad CAM, LIME, occlusion sensitivity) are used in analyzing classified results. The model exhibits adaptability to voices in various environments, yielding a classification accuracy of 87%, surpassing that of individual models. Additionally, output results are confirmed by an explainable model to extract essential emotional areas, converted into audio files for auditory analysis using Grad CAM in the time domain. Through this study, we enhance the uncertainty of activation areas that are generated by Grad CAM. We achieve this by applying the interpretable ability from previous studies, along with effective preprocessing and fusion models. We can analyze it from a more diverse perspective through other explainable techniques.
... The results of the correlation analysis for the predicted grade, breathiness, and asthenia scores were 0.741, 0.766, and 0.433, respectively. This research showed the potential for predicting vocal recuperation three months post-surgery via spectrogram analysis [16]. Patrick Schlegel et al. conducted a study to identify clinical parameters that are sensitive to functional voice disorders using boosted decision stumps. ...
Article
Full-text available
Examining the relationship between the prognostic factors and the effectiveness of voice therapy is a crucial step in developing personalized treatment strategies for individuals with voice disorders. This study recommends using the multilayer perceptron model (MLP) to comprehensively analyze the prognostic factors, with various parameters, including personal habits and acoustic parameters, that can influence the effectiveness of before-and-after voice therapy in individuals with speech disorders. Various methods, including the assessment of personal characteristics, acoustic analysis, statistical analysis, binomial logistic regression analysis, and MLP, are implemented in this experiment. Accuracies of 87.5% and 85.71% are shown for the combination of optimal input parameters for female and male voices, respectively, through the MLP model. This fact validates the selection of input parameters when building our model. Good prognostic indicators for the clinical effectiveness of voice therapy in voice disorders are jitter (post-treatment) for females and MPT (pre-treatment) for males. The results are expected to provide a foundation for modeling research utilizing artificial intelligence in voice therapy for voice disorders. In terms of follow-up studies, it will be necessary to conduct research that utilizes big data to analyze the optimal parameters for predicting the clinical effectiveness of voice disorders.
... Its accuracy surpasses that of the one-dimensional convolutional neural network (1D CNN), as 2D CNN models can extract finer features from the spectrogram [7]. Kevin et al. aimed to build a more accurate sound classification model and proposed a two-stream neural network architecture that includes the EfficientNet-based model [8]. Lee et al. utilized preoperative and postoperative voice spectrograms as features to predict three-month postoperative vocal recovery [9]. This model could be widely applicable for transfer learning in sound classification. ...
Article
Full-text available
Wearable assistant devices play an important role in daily life for people with disabilities. Those who have hearing impairments may face dangers while walking or driving on the road. The major danger is their inability to hear warning sounds from cars or ambulances. Thus, the aim of this study is to develop a wearable assistant device with edge computing, allowing the hearing impaired to recognize the warning sounds from vehicles on the road. An EfficientNet-based, fuzzy rank-based ensemble model was proposed to classify seven audio sounds, and it was embedded in an Arduino Nano 33 BLE Sense development board. The audio files were obtained from the CREMA-D dataset and the Large-Scale Audio dataset of emergency vehicle sirens on the road, with a total number of 8756 files. The seven audio sounds included four vocalizations and three sirens. The audio signal was converted into a spectrogram by using the short-time Fourier transform for feature extraction. When one of the three sirens was detected, the wearable assistant device presented alarms by vibrating and displaying messages on the OLED panel. The performances of the EfficientNet-based, fuzzy rank-based ensemble model in offline computing achieved an accuracy of 97.1%, precision of 97.79%, sensitivity of 96.8%, and specificity of 97.04%. In edge computing, the results comprised an accuracy of 95.2%, precision of 93.2%, sensitivity of 95.3%, and specificity of 95.1%. Thus, the proposed wearable assistant device has the potential benefit of helping the hearing impaired to avoid traffic accidents.
... Kevin et al. aimed to build a more accurate sound classification model and proposed a two-stream neural network architecture that includes the EfficientNet model [6]. Lee et al. utilized preoperative and postoperative voice spectrograms as features to predict three-month postoperative vocal recovery [7]. This model could be widely applied for transfer learning in sound classification. ...
Preprint
Full-text available
Wearable assistant devices play an important role in daily life for people with disabilities. Those who are hearing impaired may face dangers while walking or driving on the road. The major danger is their inability to hear warning sounds from cars or ambulances. Thus, the goal of this study is to develop a wearable assistant device for the hearing impaired to recognize emergency vehicle sirens on the road using edge computing. An EfficientNet-based fuzzy rank-based ensemble model was proposed to classify seven audio sounds, including human vocalizations and emergency vehicle sirens. This model was embedded in an Arduino Nano 33 BLE Sense development board. The audio files were respectively obtained from the CREMA-D dataset and Large Scale Audio dataset of emergency vehicle sirens on the road, with a total number of 8756 files. The seven audio sounds included neutral vocalization, anger vocalization, fear vocalization, happy vocalization, car horn sound, siren sound, and ambulance siren sound. The audio signal was converted into a spectrogram by the short-time Fourier transform as the feature. When one of the car horns, sirens, or ambulance sirens was detected, the wearable assistant device presented alarms through vibration and messages on the OLED panel. The performances of the EfficientNet-based fuzzy rank-based ensemble model in offline computing achieved an accuracy of 97.1%, precision of 97.79%, sensitivity of 96.8%, and specificity of 97.04%. In edge computing, the results were an accuracy of 95.2%, precision of 93.2%, sensitivity of 95.3%, and specificity of 95.1%. Thus, the proposed wearable assistant device has the potential benefit of helping the hearing impaired avoid traffic accidents.
Article
Purpose of review The purpose of this review is to present recent advances and limitations in machine learning applied to the evaluation of speech, voice, and swallowing in head and neck cancer. Recent findings Novel machine learning models incorporating diverse data modalities with improved discriminatory capabilities have been developed for predicting toxicities following head and neck cancer therapy, including dysphagia, dysphonia, xerostomia, and weight loss as well as guiding treatment planning. Machine learning has been applied to the care of posttreatment voice and swallowing dysfunction by offering objective and standardized assessments and aiding innovative technologies for functional restoration. Voice and speech are also being utilized in machine learning algorithms to screen laryngeal cancer. Summary Machine learning has the potential to help optimize, assess, predict, and rehabilitate voice and swallowing function in head and neck cancer patients as well as aid in cancer screening. However, existing studies are limited by the lack of sufficient external validation and generalizability, insufficient transparency and reproducibility, and no clear superior predictive modeling strategies. Algorithms and applications will need to be trained on large multiinstitutional data sets, incorporate sociodemographic data to reduce bias, and achieve validation through clinical trials for optimal performance and utility.
Article
Background: Thyroidectomy may be performed for clinical indications that include malignancy, benign nodules or cysts suspicious findings on fine needle aspiration (FNA) biopsy, dyspnea from airway compression or dysphagia from cervical esophageal compression, etc. The incidences of vocal cord palsy (VCP) caused by thyroid surgery were reported to range from 3.4% to 7.2% and 0.2% to 0.9% for temporary and permanent vocal fold palsy respectively which is a serious complication of thyroidectomy that is worrisome for patients. Objective: Therefore, it is aimed to determine the patients who have the risk of developing vocal cord palsy before thyroidectomy by using machine learning methods in the study. In this way, the possibility of developing palsy can be reduced by applying appropriate surgical techniques to individuals in the high-risk group. Method: For this aim, 1039 patients with thyroidectomy, between the years 2015 and 2018, have been used from Karadeniz Technical University Medical Faculty Farabi Hospital at the department of general surgery. The clinical risk prediction model was developed using the proposed sampling and random forest classification method on the dataset. Conclusion: As a result, a novel quite a satisfactory prediction model with 100% accuracy was developed for VCP before thyroidectomy. Using this clinical risk prediction model, physicians can be helped to identify patients at high risk of developing post-operative palsy before the operation.
Article
Diagnosis using voice is non-invasive and can be implemented through various voice recording devices; therefore, it can be used as a screening or diagnostic assistant tool for laryngeal voice disease to help clinicians. The development of artificial intelligence algorithms, such as machine learning, led by the latest deep learning technology, began with a binary classification that distinguishes normal and pathological voices; consequently, it has contributed in improving the accuracy of multi-classification to classify various types of pathological voices. However, no conclusions that can be applied in the clinical field have yet been achieved. Most studies on pathological speech classification using speech have used the continuous short vowel /ah/, which is relatively easier than using continuous or running speech. However, continuous speech has the potential to derive more accurate results as additional information can be obtained from the change in the voice signal over time. In this review, explanations of terms related to artificial intelligence research, and the latest trends in machine learning and deep learning algorithms are reviewed; furthermore, the latest research results and limitations are introduced to provide future directions for researchers.