Ten-fold cross-validation. The precision represents the overall precision for all classes. The performance for individual classes is not shown

Source publication

Automated Classification of Radiology Reports to Facilitate Retrospective Study in Radiology

Article

Full-text available

May 2014

Retrospective research is an import tool in radiology. Identifying imaging examinations appropriate for a given research question from the unstructured radiology reports is extremely useful, but labor-intensive. Using the machine learning text-mining methods implemented in LingPipe [1], we evaluated the performance of the dynamic language model (DL...

Context 1

... classification of character sequences into non-overlapping categories based on language models for each category and a multivariate distribution over categories [1]. It is an n-gram classification system based on the frequency distributions of sequences of characters with the length of N [2, 3]. The Naïve Bayesian classifier is based on a uniform whitespace language model and an optional n-gram character language model for smoothing unknown tokens [1]. It is essentially a Bag-of-Words document classification technique where the tokens (words) are assumed to be independent of one another [4]. If words in the text to be classified are not present in the training corpus, an n-gram character language model is applied. In the output, both methods calculate the probabilities for the text to be analyzed (the representative sentences of each report in this study) and the class with highest probability is assigned to the report. Validation and Testing First, we tested the performance of the classifiers in classifying the training data itself by running classification of the training dataset using an n-gram of 6, a commonly used n-gram number in text classification. Then, a 10-fold cross-validation approach [5] was used to evaluate the performance of the classification models. Briefly, the training dataset was divided into 10 equal-sized segments. For each fold, nine segments were used to train the classifiers which were then used to classify the remaining segment. After 10 folds, all sentences were classified. The models calculated probabilities of all possible classes for each sentence. The classification assigned to a sentence was the first-best category, the category with the highest probability. The results were compared to the manual classifications, and the performance statistics (precision, accuracy, and recall) were calculated. To evaluate the effects of the n-gram numbers on the performance, we ran the 10-fold validation process with different n-gram numbers from 2 through 8 for both the dynamic language model classifier and the Naïve Bayesian classifiers. A case finder computer program was implemented to search the radiology report database and classify the reports into one of the predefined six classifications (Fig. 1). Briefly, the program runs a keyword search against the report dataset and retrieves all reports that contain the keywords of interest. The reports are then parsed into individual sentences using a language model provided by the natural language processing tool kit, LingPipe [1]. Sentences containing the keywords were extracted and classified using the classification models constructed using the training data described above without further preprocessing. If two or more sentences contain the keyword(s), the sentences are concatenated before being sub- jected to the classifiers, which treat the combined sentences as a whole. The classifications of the sentences are used to represent the classifications of the reports and used to select cases for retrospective research projects. The case finder pro- vides a simple tool to search the entire database, returning a tabular output of a list of reports with classifications and the representative sentences. To test the performance of the program against our report database that had over 5 million radiology reports, we used the keywords “ sellar mass ” , “ suprasellar mass ” , or “ colloid cyst ” to search and retrieve the reports that contained the keywords. The reports were classified using the dynamic language model classifier with an n-gram of 4, as well as manually by radiologists who were without knowledge of the classifications by the case finder program. The results were compared to determine the performance of the program. A total of 14,325 sentences (including 11,430 sentences from 8,537 radiology reports from all disciplines of radiology, and an additional 2,895 sentences from brain CT and MRI reports) were manually classified as one of six predefined classes. The concordance of manual classification by the experts was estimated to be 95.6 %, based on 168 discrepant classifications out of the 3,428 sentences that had been manually classified by at least two experts. The unweighted Cohen ’ s Kappa was 0.94 with 95 % confidence interval (CI) of 0.01. When the training dataset was classified using an n-gram of 6, the accuracies for the dynamic language model (DLM) and the Naïve Bayesian (NB) classifiers were 91.6 % with 95 % CI of 0.46 % and 86.0 % with 95 % CI of 0.46 %, respectively. The confidence intervals were estimated using the binomial distribution [1]. The confusion matrices are listed in the Tables 2 and 3. Ten-fold cross-validations were performed for both the DLM and the NB classifiers using n-gram numbers from 2 to 8. The performance (precision) of the n-gram numbers with each classification method was determined. The quad-grams (n-gram of 4) were found to give the best average performances for the DLM classifier (Fig. 2). For the NB classifier, as expected, the n-gram numbers did not seem to affect the performance, likely due to the large training corpus such that most words are present in the training dataset and only a limited number of words needed n-gram-based smoothing. Overall, the DLM classifier performed slightly better than the NB. When an n-gram of 4 was used in the 10-fold cross- validation analysis, the average accuracies for the DLM and NB classifiers were 88.5 % with 95 % CI of 1.9 % and 85.9 % with 95 % CI of 2.0 %, respectively (Fig. 2). As the results suggested slightly better performance for the DLM classier, we then evaluated the performances of DLM classifiers on individual classes using accuracy, recall, and precision as the performance indicators (Fig. 3). The accuracy of all the classifications exceeded 90 % and showed essentially no difference among the groups. However, the recalls and precisions for the class “ DDx ” were 61.8 % and 71.1 %, respectively, significantly lower than the other categories. A total of 220 sentences manually assigned to the class DDx were classified incorrectly by the machine learning method, 7 as “ Negative ” , 2 as “ Normal ” , 5 as “ PostTx ” , and 186 as “ Positive ” . On the other hand, there were 316 sentences manually classified as “ Positive ” that were incorrectly classified by the machine learning method, among which 3 were assigned to “ Normal ” , 82 to “ Negative ” , and 228 to “ DDx ” . Next, we tested the performance of the DLM classifier trained using the complete training dataset to classify 1,397 reports containing the keywords “ sellar mass or suprasellar mass ” or “ colloid cyst ” . These reports were independently manually classified by radiologists in the same manner as in annotating the training data. When compared to the manual classification, the prediction model produced an overall accuracy of 88.2 % with 95 % CI of 2.1 % for 959 reports that contain “ sellar/suprasellar mass ” , and an overall accuracy ...

View in full-text

Context 2

... the tokens (words) are assumed to be independent of one another [4]. If words in the text to be classified are not present in the training corpus, an n-gram character language model is applied. In the output, both methods calculate the probabilities for the text to be analyzed (the representative sentences of each report in this study) and the class with highest probability is assigned to the report. Validation and Testing First, we tested the performance of the classifiers in classifying the training data itself by running classification of the training dataset using an n-gram of 6, a commonly used n-gram number in text classification. Then, a 10-fold cross-validation approach [5] was used to evaluate the performance of the classification models. Briefly, the training dataset was divided into 10 equal-sized segments. For each fold, nine segments were used to train the classifiers which were then used to classify the remaining segment. After 10 folds, all sentences were classified. The models calculated probabilities of all possible classes for each sentence. The classification assigned to a sentence was the first-best category, the category with the highest probability. The results were compared to the manual classifications, and the performance statistics (precision, accuracy, and recall) were calculated. To evaluate the effects of the n-gram numbers on the performance, we ran the 10-fold validation process with different n-gram numbers from 2 through 8 for both the dynamic language model classifier and the Naïve Bayesian classifiers. A case finder computer program was implemented to search the radiology report database and classify the reports into one of the predefined six classifications (Fig. 1). Briefly, the program runs a keyword search against the report dataset and retrieves all reports that contain the keywords of interest. The reports are then parsed into individual sentences using a language model provided by the natural language processing tool kit, LingPipe [1]. Sentences containing the keywords were extracted and classified using the classification models constructed using the training data described above without further preprocessing. If two or more sentences contain the keyword(s), the sentences are concatenated before being sub- jected to the classifiers, which treat the combined sentences as a whole. The classifications of the sentences are used to represent the classifications of the reports and used to select cases for retrospective research projects. The case finder pro- vides a simple tool to search the entire database, returning a tabular output of a list of reports with classifications and the representative sentences. To test the performance of the program against our report database that had over 5 million radiology reports, we used the keywords “ sellar mass ” , “ suprasellar mass ” , or “ colloid cyst ” to search and retrieve the reports that contained the keywords. The reports were classified using the dynamic language model classifier with an n-gram of 4, as well as manually by radiologists who were without knowledge of the classifications by the case finder program. The results were compared to determine the performance of the program. A total of 14,325 sentences (including 11,430 sentences from 8,537 radiology reports from all disciplines of radiology, and an additional 2,895 sentences from brain CT and MRI reports) were manually classified as one of six predefined classes. The concordance of manual classification by the experts was estimated to be 95.6 %, based on 168 discrepant classifications out of the 3,428 sentences that had been manually classified by at least two experts. The unweighted Cohen ’ s Kappa was 0.94 with 95 % confidence interval (CI) of 0.01. When the training dataset was classified using an n-gram of 6, the accuracies for the dynamic language model (DLM) and the Naïve Bayesian (NB) classifiers were 91.6 % with 95 % CI of 0.46 % and 86.0 % with 95 % CI of 0.46 %, respectively. The confidence intervals were estimated using the binomial distribution [1]. The confusion matrices are listed in the Tables 2 and 3. Ten-fold cross-validations were performed for both the DLM and the NB classifiers using n-gram numbers from 2 to 8. The performance (precision) of the n-gram numbers with each classification method was determined. The quad-grams (n-gram of 4) were found to give the best average performances for the DLM classifier (Fig. 2). For the NB classifier, as expected, the n-gram numbers did not seem to affect the performance, likely due to the large training corpus such that most words are present in the training dataset and only a limited number of words needed n-gram-based smoothing. Overall, the DLM classifier performed slightly better than the NB. When an n-gram of 4 was used in the 10-fold cross- validation analysis, the average accuracies for the DLM and NB classifiers were 88.5 % with 95 % CI of 1.9 % and 85.9 % with 95 % CI of 2.0 %, respectively (Fig. 2). As the results suggested slightly better performance for the DLM classier, we then evaluated the performances of DLM classifiers on individual classes using accuracy, recall, and precision as the performance indicators (Fig. 3). The accuracy of all the classifications exceeded 90 % and showed essentially no difference among the groups. However, the recalls and precisions for the class “ DDx ” were 61.8 % and 71.1 %, respectively, significantly lower than the other categories. A total of 220 sentences manually assigned to the class DDx were classified incorrectly by the machine learning method, 7 as “ Negative ” , 2 as “ Normal ” , 5 as “ PostTx ” , and 186 as “ Positive ” . On the other hand, there were 316 sentences manually classified as “ Positive ” that were incorrectly classified by the machine learning method, among which 3 were assigned to “ Normal ” , 82 to “ Negative ” , and 228 to “ DDx ” . Next, we tested the performance of the DLM classifier trained using the complete training dataset to classify 1,397 reports containing the keywords “ sellar mass or suprasellar mass ” or “ colloid cyst ” . These reports were independently manually classified by radiologists in the same manner as in annotating the training data. When compared to the manual classification, the prediction model produced an overall accuracy of 88.2 % with 95 % CI of 2.1 % for 959 reports that contain “ sellar/suprasellar mass ” , and an overall accuracy ...

View in full-text

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Article

Full-text available

Jun 2014

Francesco Orabona

Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability. While various methods have been proposed to speed up their convergence, the model selection phase is often ignored. In fact, in theoretical works most of the time assumptions are made, for example, on t...

Development and validation of deep learning and BERT models for classification of lung cancer radiology reports

Article

Full-text available

Jun 2023

Purpose: Manual cohort building from radiology reports can be tedious. Natural Language Processing (NLP) can be used for automated cohort building. In this study, we have developed and validated an NLP approach based on deep learning (DL) to select lung cancer reports from a thoracic disease management group cohort. Materials and methods: 4064 radiology reports (CT and PET/CT) of a thoracic disease management group reported between 2014 and 2016 were used. These reports were anonymised, cleaned, text normalized and split into a training, testing, and validation set. External validation was performed on radiology reports from the MIMIC-III clinical database. We used three DL models, namely, Bi-LSTM_simple, Bi-LSTM_dropout, and Pre-trained _BERT model to predict if a report concerned lung cancer. We studied the effect of minority oversampling on all models. Results: Without oversampling, the F1 scores at 95% CI for Bi-LSTM_simple, Bi-LSTM_dropout and BERT were 0.89, 0.90, and 0.86; with oversampling, the F1 scores were 0.94, 0.94, and 0.9, on internal validation. On external validation the F1-scores of Bi-LSTM_simple, Bi-LSTM_dropout and BERT models were 0.63, 0.77 and 0.80 without oversampling and 0.72, 0.78 and 0.77 with oversampling. Conclusion: Pre-trained BERT model and Bi-LSTM_dropout models to predict a lung cancer report showed consistent performance on internal and external validation with the BERT model exhibiting superior performance. The overall F1 score decreased on external validation for both Bi-LSTM models with the Bi-LSTM_simple model showing a more significant drop. All models showed some improvement on minority oversampling.

Clinical Concept-Based Radiology Reports Classification Pipeline for Lung Carcinoma

Article

Full-text available

Feb 2023
J DIGIT IMAGING

Rising incidence and mortality of cancer have led to an incremental amount of research in the field. To learn from preexisting data, it has become important to capture maximum information related to disease type, stage, treatment, and outcomes. Medical imaging reports are rich in this kind of information but are only present as free text. The extraction of information from such unstructured text reports is labor-intensive. The use of Natural Language Processing (NLP) tools to extract information from radiology reports can make it less time-consuming as well as more effective. In this study, we have developed and compared different models for the classification of lung carcinoma reports using clinical concepts. This study was approved by the institutional ethics committee as a retrospective study with a waiver of informed consent. A clinical concept-based classification pipeline for lung carcinoma radiology reports was developed using rule-based as well as machine learning models and compared. The machine learning models used were XGBoost and two more deep learning model architectures with bidirectional long short-term neural networks. A corpus consisting of 1700 radiology reports including computed tomography (CT) and positron emission tomography/computed tomography (PET/CT) reports were used for development and testing. Five hundred one radiology reports from MIMIC-III Clinical Database version 1.4 was used for external validation. The pipeline achieved an overall F1 score of 0.94 on the internal set and 0.74 on external validation with the rule-based algorithm using expert input giving the best performance. Among the machine learning models, the Bi-LSTM_dropout model performed better than the ML model using XGBoost and the Bi-LSTM_simple model on internal set, whereas on external validation, the Bi-LSTM_simple model performed relatively better than other 2. This pipeline can be used for clinical concept-based classification of radiology reports related to lung carcinoma from a huge corpus and also for automated annotation of these reports. Supplementary Information The online version contains supplementary material available at 10.1007/s10278-023-00787-z.

Natural Language Processing and Graph Theory: Making Sense of Imaging Records in a Novel Representation Frame

Article

Full-text available

Dec 2022

Background: A concise visualization framework of related reports would increase readability and improve patient management. To this end, temporal referrals to prior comparative exams are an essential connection to previous exams in written reports. Due to unstructured narrative texts' variable structure and content, their extraction is hampered by poor computer readability. Natural language processing (NLP) permits the extraction of structured information from unstructured texts automatically and can serve as an essential input for such a novel visualization framework. Objective: This study proposes and evaluates an NLP-based algorithm capable of extracting the temporal referrals in written radiology reports, applies it to all the radiology reports generated for 10 years, introduces a graphical representation of imaging reports, and investigates its benefits for clinical and research purposes. Methods: In this single-center, university hospital, retrospective study, we developed a convolutional neural network capable of extracting the date of referrals from imaging reports. The model's performance was assessed by calculating precision, recall, and F1-score using an independent test set of 149 reports. Next, the algorithm was applied to our department's radiology reports generated from 2011 to 2021. Finally, the reports and their metadata were represented in a modulable graph. Results: For extracting the date of referrals, the named-entity recognition (NER) model had a high precision of 0.93, a recall of 0.95, and an F1-score of 0.94. A total of 1,684,635 reports were included in the analysis. Temporal reference was mentioned in 53.3% (656,852/1,684,635), explicitly stated as not available in 21.0% (258,386/1,684,635), and omitted in 25.7% (317,059/1,684,635) of the reports. Imaging records can be visualized in a directed and modulable graph, in which the referring links represent the connecting arrows. Conclusions: Automatically extracting the date of referrals from unstructured radiology reports using deep learning NLP algorithms is feasible. Graphs refined the selection of distinct pathology pathways, facilitated the revelation of missing comparisons, and enabled the query of specific referring exam sequences. Further work is needed to evaluate its benefits in clinics, research, and resource planning.

Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets

Article

Jun 2022

Purpose: To develop and evaluate domain-specific and pretrained bidirectional encoder representations from transformers (BERT) models in a transfer learning task on varying training dataset sizes to annotate a larger overall dataset. Materials and methods: The authors retrospectively reviewed 69 095 anonymized adult chest radiograph reports (reports dated April 2020-March 2021). From the overall cohort, 1004 reports were randomly selected and labeled for the presence or absence of each of the following devices: endotracheal tube (ETT), enterogastric tube (NGT, or Dobhoff tube), central venous catheter (CVC), and Swan-Ganz catheter (SGC). Pretrained transformer models (BERT, PubMedBERT, DistilBERT, RoBERTa, and DeBERTa) were trained, validated, and tested on 60%, 20%, and 20%, respectively, of these reports through fivefold cross-validation. Additional training involved varying dataset sizes with 5%, 10%, 15%, 20%, and 40% of the 1004 reports. The best-performing epochs were used to assess area under the receiver operating characteristic curve (AUC) and determine run time on the overall dataset. Results: The highest average AUCs from fivefold cross-validation were 0.996 for ETT (RoBERTa), 0.994 for NGT (RoBERTa), 0.991 for CVC (PubMedBERT), and 0.98 for SGC (PubMedBERT). DeBERTa demonstrated the highest AUC for each support device trained on 5% of the training set. PubMedBERT showed a higher AUC with a decreasing training set size compared with BERT. Training and validation time was shortest for DistilBERT at 3 minutes 39 seconds on the annotated cohort. Conclusion: Pretrained and domain-specific transformer models required small training datasets and short training times to create a highly accurate final model that expedites autonomous annotation of large datasets.Keywords: Informatics, Named Entity Recognition, Transfer Learning Supplemental material is available for this article. ©RSNA, 2022See also the commentary by Zech in this issue.

Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke

Article

Full-text available

Feb 2019
PLOS ONE

Background and purpose: This project assessed performance of natural language processing (NLP) and machine learning (ML) algorithms for classification of brain MRI radiology reports into acute ischemic stroke (AIS) and non-AIS phenotypes. Materials and methods: All brain MRI reports from a single academic institution over a two year period were randomly divided into 2 groups for ML: training (70%) and testing (30%). Using "quanteda" NLP package, all text data were parsed into tokens to create the data frequency matrix. Ten-fold cross-validation was applied for bias correction of the training set. Labeling for AIS was performed manually, identifying clinical notes. We applied binary logistic regression, naïve Bayesian classification, single decision tree, and support vector machine for the binary classifiers, and we assessed performance of the algorithms by F1-measure. We also assessed how n-grams or term frequency-inverse document frequency weighting affected the performance of the algorithms. Results: Of all 3,204 brain MRI documents, 432 (14.3%) were labeled as AIS. AIS documents were longer in character length than those of non-AIS (median [interquartile range]; 551 [377-681] vs. 309 [164-396]). Of all ML algorithms, single decision tree had the highest F1-measure (93.2) and accuracy (98.0%). Adding bigrams to the ML model improved F1-mesaure of naïve Bayesian classification, but not in others, and term frequency-inverse document frequency weighting to data frequency matrix did not show any additional performance improvements. Conclusions: Supervised ML based NLP algorithms are useful for automatic classification of brain MRI reports for identification of AIS patients. Single decision tree was the best classifier to identify brain MRI reports with AIS.

Classification of radiology reports by modality and anatomy: A comparative study

Preprint

Full-text available

Dec 2018

Data labeling is currently a time-consuming task that often requires expert knowledge. In research settings, the availability of correctly labeled data is crucial to ensure that model predictions are accurate and useful. We propose relatively simple machine learning-based models that achieve high performance metrics in the binary and multiclass classification of radiology reports. We compare the performance of these algorithms to that of a data-driven approach based on NLP, and find that the logistic regression classifier outperforms all other models, in both the binary and multiclass classification tasks. We then choose the logistic regression binary classifier to predict chest X-ray (CXR)/ non-chest X-ray (non-CXR) labels in reports from different datasets, unseen during any training phase of any of the models. Even in unseen report collections, the binary logistic regression classifier achieves average precision values of above 0.9. Based on the regression coefficient values, we also identify frequent tokens in CXR and non-CXR reports that are features with possibly high predictive power.

Use of Machine Learning to Identify Follow-Up Recommendations in Radiology Reports

Article

Full-text available

Dec 2018
J Am Coll Radiol

Purpose: The aims of this study were to assess follow-up recommendations in radiology reports, develop and assess traditional machine learning (TML) and deep learning (DL) models in identifying follow-up, and benchmark them against a natural language processing (NLP) system. Methods: This HIPAA-compliant, institutional review board-approved study was performed at an academic medical center generating >500,000 radiology reports annually. One thousand randomly selected ultrasound, radiography, CT, and MRI reports generated in 2016 were manually reviewed and annotated for follow-up recommendations. TML (support vector machines, random forest, logistic regression) and DL (recurrent neural nets) algorithms were constructed and trained on 850 reports (training data), with subsequent optimization of model architectures and parameters. Precision, recall, and F1 score were calculated on the remaining 150 reports (test data). A previously developed and validated NLP system (iSCOUT) was also applied to the test data, with equivalent metrics calculated. Results: Follow-up recommendations were present in 12.7% of reports. The TML algorithms achieved F1 scores of 0.75 (random forest), 0.83 (logistic regression), and 0.85 (support vector machine) on the test data. DL recurrent neural nets had an F1 score of 0.71; iSCOUT also had an F1 score of 0.71. Performance of both TML and DL methods by F1 scores appeared to plateau after 500 to 700 samples while training. Conclusions: TML and DL are feasible methods to identify follow-up recommendations. These methods have great potential for near real-time monitoring of follow-up recommendations in radiology reports.

Patient Records Retrieval System for Integrated Care in Treatment of Cervical Spine Defect

Conference Paper

Full-text available

Apr 2017
Lect Notes Comput Sci

In clinical decision making, information on the treatment of patients that show similar medical conditions and symptoms to the current case, is one of most relevant information sources to create a good, evidence-based treatment plan. However, the retrieval of similar cases is still challenging and automatic support is missing. The reasons are two-fold: First, the query formulation is difficult since multiple criteria need to be selected and specified in short query phrases. Second, the discrete storage of multimedia patient records makes the retrieval and summary of a patient history extremely difficult. In this paper, we present a retrieval system for electronic health records (EHR). More specifically, a retrieval platform for EHRs for supporting clinical decision making in treatment of cervical spine defects with the information extracted from textual data of patient records is implemented as prototype. The patient cases are classified according to cervical spine defect classes, while the classification relies upon rules obtained from the corresponding defect classification schema and guidelines. In a retrospective study, the classifier is applied to clinical documents and the classification results are evaluated.

Natural Language Processing in Radiology: A Systematic Review

Article

Apr 2016
RADIOLOGY

Radiological reporting has generated large quantities of digital content within the electronic health record, which is potentially a valuable source of information for improving clinical care and supporting research. Although radiology reports are stored for communication and documentation of diagnostic imaging, harnessing their potential requires efficient and automated information extraction: they exist mainly as free-text clinical narrative, from which it is a major challenge to obtain structured data. Natural language processing (NLP) provides techniques that aid the conversion of text into a structured representation, and thus enables computers to derive meaning from human (ie, natural language) input. Used on radiology reports, NLP techniques enable automatic identification and extraction of information. By exploring the various purposes for their use, this review examines how radiology benefits from NLP. A systematic literature search identified 67 relevant publications describing NLP methods that support practical applications in radiology. This review takes a close look at the individual studies in terms of tasks (ie, the extracted information), the NLP methodology and tools used, and their application purpose and performance results. Additionally, limitations, future challenges, and requirements for advancing NLP in radiology will be discussed. http://pubs.rsna.org/doi/abs/10.1148/radiol.16142770 © RSNA, 2016

A system for feature extraction and classification of ovarian CT radiology reports

Conference Paper

Dec 2023

Ten-fold cross-validation. The precision represents the overall precision for all classes. The performance for individual classes is not shown

Contexts in source publication

Similar publications

Citations