TABLE 1 - uploaded by Bruce Gerratt
Content may be subject to copyright.
Measures of Intrarater (test-retest) reliability. 

Measures of Intrarater (test-retest) reliability. 

Source publication
Article
Full-text available
The reliability of listeners' ratings of voice quality is a central issue in voice research because of the clinical primacy of such ratings and because they are the standard against which other measures are evaluated. However, an extensive literature review indicates that both intrarater and interrater reliability fluctuate greatly from study to st...

Context in source publication

Context 1
... experimental session lasted approx- imately 15 minutes. Table 1 lists values of the most common measures of test-retest agreement and reliability, calculated from our data. On the average, ratings of these voices did not vary much from first to second rating. ...

Similar publications

Article
Full-text available
Aperiodicity of speech alters voice quality. The current study investigated the relationship between vowel aperiodicity and human auditory cortical N1m and sustained field (SF) responses with magnetoencephalography. Behavioral estimates of vocal roughness perception were also collected. Stimulus aperiodicity was experimentally varied by increasing...
Article
Full-text available
Purpose Identify the terms mentioned by the general population for healthy, rough and breathy vocal quality. Methods A test was carried out with 50 participants, in person, without academic or professional ties with Speech Therapy. The task was to hear three voices and define them freely. The first voice presented was predominantly breathy; the se...
Article
Full-text available
Acoustic analysis of speech signal enables automatic detection and classification of voice disorders along with its severity. This automatic assessment provides help to the clinician in initial diagnosis of pathological larynx in non-intrusive way. Voice pathologies damage the vocal cords and consequently alter the dynamics (fluctuation speed) of v...
Article
Full-text available
Purpose: to verify if teachers with less vocal use due to reduced workload have fewer complaints of vocal disorders and better environmental and organizational working conditions. Methods: 46 teachers of both genders, with a mean age of 39.5 years old, and 15 years of career length participated in this study. The individuals were divided into gr...
Article
Full-text available
This study explores how stereotypical preconceptions about gender and conversational behaviour may affect observers’ perceptions of a speaker’s performance. Using updated matched-guise techniques, we digitally manipulated the same recording of a conversation to alter the voice quality of “Speaker A” to sound “male” or “female.” Respondents’ percept...

Citations

... Raters will be blinded to the stage of treatment and will not have treated the participant. Raters will receive a brief refresher training programme in the use of CAPE-V to improve inter-rater reliability and will have external anchor voices provided to overcome the reduced intrarater and inter-rater agreement associated with the increased freedom of judgement [52]. The mean rating of the raters for each recording will be used as the data point for individual patients. ...
Article
Full-text available
Background Management of benign vocal fold lesions (BVFLs) is variable with individuals receiving surgery, voice therapy, or a combination of these approaches. Some evidence suggests that the best outcomes may be achieved when patients are offered pre- and post-operative voice therapy in addition to phonosurgery, but what constitutes pre- and post-operative voice therapy is poorly described. The pre- and post-operative voice therapy (PAPOV) intervention has been developed and described according to the TIDieR checklist and Rehabilitation Treatment Specification System (RTSS) for voice. The PAPOV intervention is delivered by specialist speech and language therapists trained in the intervention and comprises 7 essential and 4 additional components, delivered in voice therapy sessions with patients who are having surgery on their vocal folds for removal of BVFLs. Study design Non-randomised, multicentre feasibility trial with embedded process evaluation. Method Forty patients from two sites who are due to undergo phonosurgery will be recruited to receive the PAPOV intervention. Measures of feasibility, including recruitment, retention, and adherence, will be assessed. The feasibility of gathering clinical and cost effectiveness data will be measured pre-treatment, then at 3 and 6 months post-operatively. An embedded process evaluation will be undertaken to explain feasibility findings. Discussion This study will assess the feasibility of delivering a described voice therapy intervention protocol to patients who are undergoing surgery for removal of BVFLs. Findings will be used to inform the development and implementation of a subsequent effectiveness trial, should this be feasible. Trial registration This trial has been prospectively registered on ISRCTN (date 4th January 2023), registration number 17438192, and can be viewed here: https://www.isrctn.com/ISRCTN17438192.
... The stimuli were presented to the judges in a randomized order during each rating task [44]. The judges were allowed to play each sample only one time, since repetition was not regarded necessary, taking that each sample already consisted of three repetitions of the same stimulus. ...
... Once judges rated a sample and moved to the next one, they could not go back to the previous one. Break times of one-minute and a half were included at the end of each rating task to avoid attentional and hearing fatigue [35,44]. Appendices A, B, C, and D provide a summary of the spectral features of the stimuli included on each rating task. . ...
... Thus, the internal parameters of the experienced subjects were likely shaped according to their type of experience and training over the years. A clinician whose professional experience is most commonly related to neurological disorders likely has different internal reference standards from another speech-language pathologist more experienced in enhancing the voice of singers, for example (11,57) . In addition, a professional whose training was more focused on phonetic evaluation of vocal quality may not have the same standards as a clinician whose training primarily focused on the VAS, and vice versa (11,16,22,57) . ...
... A clinician whose professional experience is most commonly related to neurological disorders likely has different internal reference standards from another speech-language pathologist more experienced in enhancing the voice of singers, for example (11,57) . In addition, a professional whose training was more focused on phonetic evaluation of vocal quality may not have the same standards as a clinician whose training primarily focused on the VAS, and vice versa (11,16,22,57) . All these factors may be possible sources of variability among the listeners. ...
... This decreased interrater reliability can be explained by the single stimulus incorporating different types of speech tasks (sustained vowels and connected speech), which causes variability in the perceptual parameters to be extracted from different speech samples. Hence, the higher the number of parameters of analysis is, the more difficult the rating will be, and the more auditory skills will be required (1,19,45,57) . In addition, the rater may focus on a specific segment when analyzing the voice recording, on either the vowel or the connected speech, thus increasing the variation of the data from the auditory-perceptual analysis (13) . ...
Article
Full-text available
Purpose To assess the influence of the listener experience, measurement scales and the type of speech task on the auditory-perceptual evaluation of the overall severity (OS) of voice deviation and the predominant type of voice (rough, breathy or strain). Methods 22 listeners, divided into four groups participated in the study: speech-language pathologist specialized in voice (SLP-V), SLP non specialized in voice (SLP-NV), graduate students with auditory-perceptual analysis training (GS-T), and graduate students without auditory-perceptual analysis training (GS-U). The subjects rated the OS of voice deviation and the predominant type of voice of 44 voices by visual analog scale (VAS) and the numerical scale (score “G” from GRBAS), corresponding to six speech tasks such as sustained vowel /a/ and /ɛ/, sentences, number counting, running speech, and all five previous tasks together. Results Sentences obtained the best interrater reliability in each group, using both VAS and GRBAS. SLP-NV group demonstrated the best interrater reliability in OS judgment in different speech tasks using VAS or GRBAS. Sustained vowel (/a/ and /ɛ/) and running speech obtained the best interrater reliability among the groups of listeners in judging the predominant vocal quality. GS-T group got the best result of interrater reliability in judging the predominant vocal quality. Conclusion The time of experience in the auditory-perceptual judgment of the voice, the type of training to which they were submitted, and the type of speech task influence the reliability of the auditory-perceptual evaluation of vocal quality. Keywords: Voice; Auditory-perceptual Analysis; Severity of voice Disorder; Vocal Quality; Voice Disorders; Reliability
... The acoustic features used in the present study and their physiological and perceptual correlates have previously been described in detail in comprehensive reviews [16], [38], [39]. The fundamental frequency (f0) is the lowest frequency of a periodic waveform and is perceived as the pitch of a voice [40]. ...
Article
Full-text available
Neurodegenerative disease often affects speech. Speech acoustics can be used as objective clinical markers of pathology. Previous investigations of pathological speech have primarily compared controls with one specific condition and excluded comorbidities. We broaden the utility of speech markers by examining how multiple acoustic features can delineate diseases. We used supervised machine learning with gradient boosting (CatBoost) to delineate healthy speech from speech of people with multiple sclerosis or Friedreich ataxia. Participants performed a diadochokinetic task where they repeated alternating syllables. We subjected 74 spectral and temporal prosodic features from the speech recordings to machine learning. Results showed that Friedreich ataxia, multiple sclerosis and healthy controls were all identified with high accuracy (over 82%). Twenty-one acoustic features were strong markers of neurodegenerative diseases, falling under the categories of spectral qualia, spectral power, and speech rate. We demonstrated that speech markers can delineate neurodegenerative diseases and distinguish healthy speech from pathological speech with high accuracy. Findings emphasize the importance of examining speech outcomes when assessing indicators of neurodegenerative disease. We propose large-scale initiatives to broaden the scope for differentiating other neurological diseases and affective disorders.
... However, in less than half of the ratings all three raters agreed, which preferably could have been higher. It is well known that raters have their own internal standards (Kreiman et al., 1993;Keuning et al., 2004). In this trial the SLP/Ts were all experienced but came from five different countries with different concepts of assessment. ...
Article
Full-text available
Background & aim: To assess consonant proficiency and velopharyngeal function in 10-year-old children born with unilateral cleft lip and palate (UCLP) within the Scandcleft project. Methods & procedures: Three parallel group, randomized, clinical trials were undertaken as an international multicentre study by nine cleft teams in five countries. Three different surgical protocols for primary palate repair (Arm B-Lip and soft palate closure at 3-4 months, hard palate closure at 36 months, Arm C-Lip closure at 3-4 months, hard and soft palate closure at 12 months, and Arm D-Lip closure at 3-4 months combined with a single-layer closure of the hard palate using a vomer flap, soft palate closure at 12 months) were tested against a common procedure (Arm A-Lip and soft palate closure at 3-4 months followed by hard palate closure at 12 months) in the total cohort of 431 children born with a non-syndromic UCLP. Speech audio and video recordings of 399 children were available and perceptually analysed. Percentage of consonants correct (PCC) from a naming test, an overall rating of velopharyngeal competence (VPC) (VPC-Rate), and a composite measure (VPC-Sum) were reported. Outcomes & results: The mean levels of consonant proficiency (PCC score) in the trial arms were 86-92% and between 58% and 83% of the children had VPC (VPC-Sum). Only 50-73% of the participants had a consonant proficiency level with their peers. Girls performed better throughout. Long delay of the hard palate repair (Arm B) indicated lower PCC and simultaneous hard and soft palate closure higher (Arm C). However, the proportion of participants with primary VPC (not including velopharyngeal surgeries) was highest in Arm B (68%) and lowest in Arm C (47%). Conclusions & implications: The speech outcome in terms of PCC and VPC was low across the trials. The different protocols had their pros and cons and there is no obvious evidence to recommend any of the protocols as superior. Aspects other than primary surgical method, such as time after velopharyngeal surgery, surgical experience, hearing level, language difficulties and speech therapy, need to be thoroughly reviewed for a better understanding of what has affected speech outcome at 10 years. What this paper adds: What is already known on the subject Speech outcomes at 10 years of age in children treated for UCLP are sparse and contradictory. Previous studies have examined speech outcomes and the relationship with surgical intervention in 5-year-olds. What this study adds to the existing knowledge Speech outcomes based on standardized assessment in a large group of 10-year-old children born with UCLP and surgically treated according to different protocols are presented. While speech therapy had been provided, a large proportion of the children across treatment protocols still needed further speech therapy. What are the potential or actual clinical implications of this work? Aspects other than surgery and speech function might add to the understanding of what affects speech outcome. Effective speech therapy should be available for children in addition to primary surgical repair of the cleft and secondary surgeries if needed.
... However, highly experienced listeners (otolaryngologists and speech therapists) have to be involved to provide a reliable perceptual evaluation. Furthermore, the score differentiation is a difficult task when only pathological voices are examined, as in the case of the substitution voices of PLPs [7]. ...
Article
This paper deals with the analysis of substitution voices in patients who underwent partial laryngectomy for laryngeal cancer, with the aim of identifying a reliable methodology to provide an objective evaluation of post-intervention phonatory impairment and of the effectiveness of rehabilitation therapies. The investigated data-set includes 85 patients who underwent type I Open Partial Horizontal Laryngectomy (22 subjects), type II OPHL (32 subjects) and type III OPHL (31 subjects). The available vocal material (reading task and sustained vowel) was pre-processed in order to remove non-harmonic frames from the patients’ records using two different algorithms. After this preliminary step, a series of features that belong to time, spectral and cepstral domains were extracted from the selected harmonic frames. Then, two different comparisons were made between the classes OPHL-I vs OPHL-II+III and the classes OPHL-II+III(I< 5) vs OPHL-II+III(I≥ 5), where the index I (Intelligibility) of the auditory perceptual scale INFVo was assessed during a preliminary evaluation. Two different feature-selection techniques, which are based on the comparison among the probability distributions of the extracted features and the classification performance of a logistic regression model, identified the features with the best discrimination capabilities, which are harmonic-to-noise ratio, fundamental frequency, spectral kurtosis, spectral entropy and mel-frequency cepstral coefficients. The best classification accuracy of 96.5% (5-fold cross validation) was obtained in the comparison OPHL-I vs OPHL-II+III using a logistic regression model that was trained using the 5° and 95° percentile of the fundamental frequency and the 95° percentile of the spectral entropy extracted from the reading task.
... Another major drawback of subjective evaluation is inter-and intra-listener variability in voice perception by a jury of experts. This variability can be influenced by the context, emotional state or attention of the listener [6]. ...
... The LSTM ANN was implemented for pathological voice detection for the first time. Other machine learning tools such as vector support machines and artificial neural networks were already used in similar work [6] but for other pathologies. The application of LSTM recurrent neural networks, in an automatic classification (detection) system of pathological voice, allowed us to have appreciable results. ...
Article
Full-text available
In this study, we propose a method based on Recurrent Neural Networks, to objectively evaluate the process of rehabilitation of the pathological voice, in an Algerian clinical environment. We choose Unilateral Laryngeal Paralysis as the pathology of the voice. In this paper, we used a Deep Learning system of pathological voice detection by Long Short Term Memory neural model (LSTM). As the dysphony studied in our work concerns essentially the laryngeal vibration, we choose the acoustic parameters based on the instability of the frequency and the amplitude of the laryngeal vibration: Jitter and Shimmer, Noise parameters and Cepstraux MFCC coefficients (Mel Frequency Cepstral Coefficients). A pathological voice detection rate of 88.65% shows important results brought by the rehabilitation technique adopted in Algerian clinical setting. The exclusive and abusive use of hearing to evaluate the effect of speech rehabilitation in the Algerian hospital environment remains insufficient. It is important to correlate perceptual data with objective methods based on detection and classification methods by introducing relevant acoustic parameters, for an effective and objective management of vocal pathology assessment.
... Furthermore, to help establish the precise role that assigning (non-)nativeness plays in speaker evaluation processes, it is also important to understand whether other speech characteristics, such as a speaker's voice, influences speaker evaluations. A speaker's voice communicates important indexical information, such as a speaker's gender, age, and emotional state with voice having the ability to express approximately 24 emotions (e.g., Majid, 2012), regardless of the language used (Cowen, Elfenbein, Laukka & Keltner 2019;Pell, Monetta, Paulmannn & Kotz 2009;Kreiman, Gerratt, Kempster, Erman & Berke 1993). Moreover, an individual's voice can vary on the basis of cultural background and communication context, but voice use also changes on the basis of relationships people have with specific people and groups. ...
... Listeners bring their own states and traits to the task of perceiving a stimulus and assigning a rating to it, such as hearing status, familiarity with the speaker, and experience with dysphonic voices. 28,29 Listeners in an auditory-perceptual task are the tools of measurement; to provide valid outcomes, their judgments should be calibrated as closely as possible within the constraints of these individual listener differences. To interpret auditory-perceptual data, it is critical to report reliability and agreement within and among raters. ...
... As described by Kreiman and colleagues, "Ratings are reliable when the relationship of one rated voice to another is constant (i.e., when voice ratings are parallel or correlated), although the absolute rating may differ from listener to listener (p.36)." 28 High interrater reliability (i.e., r >.95) has been reported for both experienced and inexperienced raters judging overall severity of dysphonia; 30,31 however, these figures are based on Cronbach's alpha or averagemeasures intraclass correlations (ICC), which measure the association between each rating and the mean of every other rating for a given dimension. Kreiman et al. 28 refer to this as the reliability of the average rating; in Shrout & Fleiss terminology, this is ICC model (2, k). ...
... 28 High interrater reliability (i.e., r >.95) has been reported for both experienced and inexperienced raters judging overall severity of dysphonia; 30,31 however, these figures are based on Cronbach's alpha or averagemeasures intraclass correlations (ICC), which measure the association between each rating and the mean of every other rating for a given dimension. Kreiman et al. 28 refer to this as the reliability of the average rating; in Shrout & Fleiss terminology, this is ICC model (2, k). ...
Article
Objectives: The CAPE-V is a widely used protocol developed to help standardize the evaluation of voice. Variability of voice quality ratings has prevented development of training protocols that might themselves improve interrater agreement among new clinicians. As part of a larger mixed methods project, this study examines agreement and reliability for experienced clinicians using the CAPE-V scales. Study design: Observational. Methods: Experienced voice clinicians (N=20) provided ratings of recordings from 12 speakers representing a range of overall voice quality. Participants were instructed to rate the voices as they normally would, using the CAPE-V scales. Descriptive data were recorded and two levels of agreement were calculated. Single rater reliability was calculated using a 2-way random model of absolute agreement for intraclass correlations (ICC [2,1]). Results: Participants use of the CAPE-V scales varied considerably, although most rated overall severity, breathiness, roughness and strain. Data from one participant did not meet a priori agreement criteria. Because outcomes were significantly different without their data, agreement and reliability were analyzed based on the reduced data set from 19 participants. Interrater agreement and reliability were comparable to previous research; the mean range of ratings was at least 47mm for all dimensions of voice quality. Conclusions: Results indicated differential use of the components of the CAPE-V form and scales in evaluating voice quality and severity of dysphonia, including categorical variability among ratings of all of the primary CAPE-V dimensions of voice quality that may complicate the clinical description of a voice as mildly, moderately or severely dysphonic.
... The obtained data are first analyzed for intra-evaluator agreement using exact agreement statistics. [40] The intra-evaluator agreement for I and N is 0.96 and 0.88 respectively with an average of 0.92. Intra-class correlation coefficients (ICC) from analysis of variance two factor without replication were used to analyze inter-evaluator reliability. ...
Article
Full-text available
Objective: The current work aims to design and develop an automatically controlled wearable electrolarynx, a voice substitution device for laryngeal carcinoma survivals. Methods: The physical activity of mouth opening is sensed, amplified, and made to act as an enable signal to trigger the wearable electrolarynx. The resulting speech is recorded and compared for its voice reaction durations with that of manual electrolarynx and normal speaking methods. Perception evaluations of 5 subjects from 10 speech-language therapists are obtained. Results: The wearable electrolarynx turn-on in 13 μs once the mouth movement for speech is sensed. The voice initiation time and termination durations are 215.68 m and 231.41 ms, respectively. Results indicate that there is no significant difference (P < 0.05) between the voice reaction durations of wearable electrolarynx and normal speaking methods. The subjective evaluation results show that there is a significant improvement (P < 0.05) in intelligibility and noise reduction when compared to a commercially available electrolarynx with an average intra-class correlation coefficient of 0.68 from analysis of variance two factors without replication. Conclusions: The assessment of the wearable and automatically controlled electrolarynx provides hands-free speech and easy control over the device.