PreprintPDF Available

Med7: a transferable clinical natural language processing model for electronic health records

March 2020

March 2020

Authors:

Andrey Kormilitzin

University of Oxford

Nemanja Vaci

The University of Sheffield

Alejo Nevado-Holgado

University of Oxford

Preprints and early-stage research may not have been peer reviewed yet.

The field of clinical natural language processing has been advanced significantly since the introduction of deep learning models. The self-supervised representation learning and the transfer learning paradigm became the methods of choice in many natural language processing application, in particular in the settings with the dearth of high quality manually annotated data. Electronic health record systems are ubiquitous and the majority of patients' data are now being collected electronically and in particular in the form of free text. Identification of medical concepts and information extraction is a challenging task, yet important ingredient for parsing unstructured data into structured and tabulated format for downstream analytical tasks. In this work we introduced a named-entity recognition model for clinical natural language processing. The model is trained to recognise seven categories: drug names, route, frequency, dosage, strength, form, duration. The model was first self-supervisedly pre-trained by predicting the next word, using a collection of 2 million free-text patients' records from MIMIC-III corpora and then fine-tuned on the named-entity recognition task. The model achieved a lenient (strict) micro-averaged F1 score of 0.957 (0.893) across all seven categories. Additionally, we evaluated the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK. A direct application of the trained NER model to CRIS data resulted in reduced performance of F1=0.762, however after fine-tuning on a small sample from CRIS, the model achieved a reasonable performance of F1=0.944. This demonstrated that despite a close similarity between the data sets and the NER tasks, it is essential to fine-tune on the target domain data in order to achieve more accurate results.

Token-level confusion matrix of the predicted entities versus the ground truth labels. Spurious examples cor- respond to predicted entity boundary and type which do not exist in ground-truth annotations and partial matches corre- spond to predicted entity boundary overlap with golden annotation, but they are not the same. Missing entities correspond to ground-truth annotation boundary which were not identified.

…

Figures - uploaded by Nemanja Vaci

Content may be subject to copyright.

Content uploaded by Nemanja Vaci

Content may be subject to copyright.

Med7: a transferable clinical natural language processing model

for electronic health records

Andrey Kormilitzina,1, Nemanja Vacia, Qiang Liua, Alejo Nevado-Holgadoa

aDepartment of Psychiatry, Warneford Hospital, Oxford, OX3 7JX, UK

Abstract

The ﬁeld of clinical natural language processing has been advanced signiﬁcantly since the in-

troduction of deep learning models. The self-supervised representation learning and the transfer

learning paradigm became the methods of choice in many natural language processing applica-

tion, in particular in the settings with the dearth of high quality manually annotated data. Elec-

tronic health record systems are ubiquitous and the majority of patients’ data are now being col-

lected electronically and in particular in the form of free text. Identiﬁcation of medical concepts

and information extraction is a challenging task, yet important ingredient for parsing unstruc-

tured data into structured and tabulated format for downstream analytical tasks. In this work we

introduced a named-entity recognition model for clinical natural language processing. The model

is trained to recognise seven categories: drug names, route, frequency, dosage, strength, form,

duration. The model was ﬁrst self-supervisedly pre-trained by predicting the next word, using

a collection of 2 million free-text patients’ records from MIMIC-III corpora and then ﬁne-tuned

on the named-entity recognition task. The model achieved a lenient (strict) micro-averaged F1

score of 0.957 (0.893) across all seven categories. Additionally, we evaluated the transferability

of the developed model using the data from the Intensive Care Unit in the US to secondary care

mental health records (CRIS) in the UK. A direct application of the trained NER model to CRIS

data resulted in reduced performance of F1=0.762, however after ﬁne-tuning on a small sample

from CRIS, the model achieved a reasonable performance of F1=0.944. This demonstrated that

despite a close similarity between the data sets and the NER tasks, it is essential to ﬁne-tune on

the target domain data in order to achieve more accurate results.

Keywords: clinical natural language processing, neural networks, self-supervised learning,

noisy labelling, active learning

1. Introduction

Recent years have seen remarkable technological advances in digital platforms and in medicine

and healthcare in particular. The majority of patients’ medical records are now being collected

electronically and represent unparalleled opportunities for research, delivering better health care

and improving patients’ outcomes. However, a substantial amount of patients’ information is

contained in a free-text form as summarised by clinicians, nurses and care givers through the

1Corresponding author: andrey.kormilitzin@psych.ox.ac.uk

Preprint submitted to Elsevier March 4, 2020

arXiv:2003.01271v1 [cs.CL] 3 Mar 2020

interview and assessments. The free-text medical records normally contain very rich informa-

tion about a patient’s history as it is expressed in natural language and allows to reﬂect nuanced

details, however it poses certain challenges in the utilisation of free-text records as opposed to

structured and ready-to-use data source. Manual processing of all patients’ free-texts records

severely limits the utilisation of unstructured data and makes the process of data mining ex-

tremely expensive. On the other hand, machine learning algorithms are well poised to process a

large amount of data, spot unusual interactions and extract meaningful information. Recent lines

of research in the ﬁeld of natural language processing (NLP), such as deep contextualised word

representations [1], Transformer-based architectures [2] and large language models [3], oﬀer new

opportunities to clinical natural language processing using unstructured medical records [4].

Identiﬁcation of concepts of interest in free-texts is a sub-task of information extraction (IE),

more commonly known as named-entity recognition (NER), seeks to classify words into pre-

deﬁned categories [5] and assign labels to them. A robust and accurate NER model for identiﬁ-

cation of medical concepts, such as drug names, strength, frequency of administration, reported

symptoms, diagnoses, health score and many more, is an essential and foundational component

for any clinical IE system.

2. Related work

The topic of clinical natural language processing and information extraction has been actively

researched over the past years, in particular with the introduction and adoption of electronic

health records platforms. The methods have evolved from simple logic and rule-based systems

to complex deep learning architectures [6, 7]. One of the common approaches to information

extraction is by transforming free text data into coded representation via lookup tables, such

as universal medical language system (UMLS) [8] or structured clinical vocabulary for use in

an electronic health record (SNOMED CT). Some rule-based systems used semantic lexicons to

identify concepts in biomedical literature [9] with more complex linguistic features. With the ad-

vances in machine learning algorithms, such methods as hidden Markov models and conditional

random ﬁelds [10] were used to label entities for the NER task. In in last decade, deep learning

methods have played an essential role in developing more capable models for natural language

processing and in particular, in the biomedical domain. Word embeddings [11, 12] were intro-

duced as numerical representation of textual data and were used as input layers to deep neural

networks. For a comprehensive review on word embeddings for clinical applications please refer

to [13]. More recently, the unsupervised model pre-training on a large collection of unlabelled

data with further ﬁne-tuning on a downstream task, has taken oﬀand demonstrated its high po-

tential [14]. Since the introduction of the Transformer-based deep neural network architectures,

such as BERT [3], Roberta [15], XLNet [16] and others, the transfer learning approach of reusing

pre-trained models became the method of choice for the majority of NLP tasks. Some notable

examples of pre-trained deep learning models for biomedical natural language processing are:

BioBERT [17] for text-mining, ClinicalBERT [18, 19] for contextual word representations ﬁne-

tuned on the electronic health records and predicting hospital readmission. Another open source

Python library ’scispaCy’ [20] was recently introduced for biomedical natural language process-

ing. In this work we developed an open source named-entity recognition model dedicated to

identiﬁcation of seven categories related to medications mentioned in free-text electronic patient

records.

3. Materials and Methods

3.1. Data

The annotated data set was sourced from MIMIC-III (Medical Information Mart for Inten-

sive Care-III) electronic health records data base [21] as part of the Track 2 of The 2018 National

NLP Clinical Challenges (n2c2) Shared Task on drug related concepts extraction, including ad-

verse drug events (ADE) and reasons for prescription [22]. The data set comprised a collection

of discharge letters from the Intensive Care Unit (ICU) and contained very rich and detailed in-

formation about medications used for treatment. The data set was randomly split and provided

by the organisers into training and test sets with 303 and 202 documents respectively. The doc-

uments were annotated for nine categories: ADE, Dosage, Drug, Duration, Form, Frequency,

Reason, Route and Strength. For the purpose of the current work we considered only seven

drug-related categories and discarded two categories such as ADE and Reason. We aimed to

develop a model for medications and their related information extraction which will be beneﬁ-

cial to biomedical community and be robustly used in a variety of downstream nature language

processing tasks using free text medical records. The description of the data sets and annotation

statistics are summarised in Table 1.

Types of annotated entities Train Test Total

Dosage 4227 2681 6908

Drug 16257 10575 26832

Duration 592 378 970

Form 6657 4359 11016

Frequency 6281 4012 10293

Route 5460 3513 8973

Strength 6694 4230 10924

Number of documents 303 202 505

Total number of words 957972 627771 1585743

Total number of unique words 27602 21729 35763

Table 1: Distribution of gold-annotated entities and text summary statistics of the training and test data sets. The number

of unique tokens is computed by lowercasing words.

In addition to MIMIC-III and 2018 n2c2 data sets, we evaluated the developed model on elec-

tronic medical records sourced from the Clinical Record Interactive Search (UK-CRIS) platform,

which is the largest secondary care mental health database in the United Kingdom. UK-CRIS

contains more than 500 million clinical notes from 2.7 million de-identiﬁed patient records from

12 National Health Service (NHS) Network Partners across the UK 2.

3.2. Methods

3.2.1. Text pre-processing

In order to compare the performance of the developed medication extraction model using

MIMIC-III (n2c2 2018) and UK-CRIS data, basic text cleaning and pre-processing steps were

taken to standardise texts. UK-CRIS notes that were uploaded as scanned documents and trans-

formed into electronic texts via optical character recognition (OCR) process, were cleaned from

2https://crisnetwork.co

such artefacts as email addresses, non-ASCII characters, website URLs, HTML or XML tags.

Additionally, standard escape sequences (’\t’, ’\n’ and ’\r’) were also removed and the oﬀsets

of gold-annotated entities were adjusted accordingly.

3.2.2. Self-supervised learning

The main obstacle to developing an accurate information extraction model is the dearth of

a suﬃcient amount of high-quality annotated data to train the model. In contrast to publicly

available large manually annotated data sets for computer vision [23, 24] and for various natural

language processing downstream tasks [25, 26, 27] manually annotated texts for clinical concepts

extraction are quite rare [22]. The shortage of annotated clinical data is mainly due to privacy

concerns and potential identiﬁcation of personal medical information of patients. Several lines

of research have addressed the problem of learning from limited annotated data in the clinical

domain [28, 29, 30] and pre-training of the underlying language model and word representations

generally leads to better performance with less data [14].

In this work, we used the spaCy’s 3implementation of a cloze-style word reconstruction, sim-

ilar to the masked language model objectives introduced in BERT [3], but instead of predicting

the exact word identiﬁer from the vocabulary, the GloVe [12] word’s vector was predicted using

a static embedding table with a cosine loss function. The pre-trained language model was then

used to initialise the weights of convolutional neural network layers, rather than starting with ran-

dom weights. We experimented with various combinations of hyperparameters of the language

model, such as the number of rows and width of embedding tables and a depth of convolutional

layers.

3.2.3. Named entity recognition model

The task of locating concepts of interest in unstructured text and their subsequent classiﬁca-

tion into predeﬁned categories, for example: drug names, dosages or frequency of administration

is a sub-task of information extraction and called named-entity recognition (NER). There are

various implementations of NER systems, ranging from rule-based string matching approaches

[5] to complex Transformer models [2] or their hybrid combinations. In this work the named-

entity recognition model for extraction of medication information was implemented in Python

3.7 using spaCy open source library for NLP tasks [31]. Although there exists a good number of

NLP libraries, such as: NLTK [32], NLP4J [33], Stanford CoreNLP [34], Apache OpenNLP and

a very recent open source collection of Transformer-based models from Hugging Face Inc. [35],

the spaCy library is optimised for speed on CPUs, has an intuitive API and easily integrates with

the active learning-based annotation tool Prodigy [36]. The architecture of SpaCy’s NER model

is based on convolutional neural networks with tokens represented as hashed Bloom embeddings

[37] of preﬁx, suﬃx and lemmatisation of individual words augmented with a transition-based

chunking model [38]. We also experimented with various combinations of hyperparameters of

the neural network architecture, dropout rates, batch compounding, learning rate and regularisa-

tion schemes. We set aside 30 documents (10%) sampled at random from the training data as a

validation set.

3.2.4. Model training augmentation with bootstrapped noisy labels

Several recent lines of research have demonstrated a clear beneﬁt in terms of achieving higher

accuracy and better generalisation of neural networks trained with corrupted, noisy and syn-

3https://spacy.io

thetically augmented data [39, 40, 41, 42]. Training with data augmentation also alleviates the

problem of learning from a limited amount of manually annotated data. Similar to the idea pre-

sented in ’Snorkel’ [43], we designed a number of labelling functions (LF) by compiling a list

of rules and keyword patterns for all seven named-entity categories. Additionally, we exploited

a ’sense2vec’ approach [44] which was ﬁne-tuned on the entire MIMIC-III corpora to boot-

strap keywords and patterns. ’Sense2vec’ is a more complex version of the ’Word2vec’ method

[45] for representation of words as vectors. The major improvement over ’word2vec’ is that

’sense2vec’ also learns from linguistic annotations of words for sense disambiguation in their

embeddings.

The resulting labelling functions were used to created a ’silver’ training set consisting of

annotated data by string pattern matching. The NER model was then trained by using a combi-

nation of gold and silver annotated examples in each batch. In order to prevent data leakage and

a biased inﬂation of the performance metrics, such as precision and recall, the model was tested

only on gold annotated data set comprising 202 documents (cf. Table 1) provided by the n2c2

2018 challenge.

3.2.5. Model evaluation

In order to estimate the performance of the proposed named-entity recognition model, we

used the evaluation schema proposed in SemEval’13 and outlined in Appendix A. The evaluation

schema comprised a number of potential errors categories produced by the model and the model

performance metrics, such as precision and recall were computed using the expressions A.1.

Under the current evaluation schema, partial match was considered as an exact match between

the gold-annotated and the predicted labels while no restriction was imposed on the boundaries

of the tokens. The rationale behind this approach was obvious from the ambiguity in gold-

annotations examples corresponding to the same concept. For example, both sequences ’for 3

weeks’ and ’3 weeks’ were labelled as ’Duration’. In particular, 492 of 967 (71%) text spans

labelled as ’Duration’ started with the word ’for’.

We estimated both, strict and lenient metrics. Strict metrics accounts only for the exact match

in both, surface strings and the corresponding labels, whereas the lenient metrics allow for partial

matches. Speciﬁcally, strict and lenient metrics were obtained from A.1 with α=0 and α=1

correspondingly. We reported both, micro and macro averaged precision and recall and their

corresponding F1 scores.

4. Results

4.1. Model pre-training

The pre-training task was performed on the entire MIMIC-III data set for 350 epochs using a

number of conﬁgurations of the width and depth of the convolutional layers. Each conﬁguration

was trained on a single GTX 2080 Ti GPU. CNN dimensions, summary statistics of the pre-

training text corpus, the average running time per epoch in minutes and the model size in MB are

summarised in Table 2. The corresponding training losses, logarithmically scaled, are plotted in

Fig. 1.

4.2. Rationale for collecting more training data

Conﬁguration Width Depth Time Size

(default) 96 4 73 3.8

128 8 90 18.3

256 8 118 47.6

256 16 164 66.1

Number of documents 2,083,054

Number of words 3,129,334,419

Table 2: Model pre-training characteristics for various combina-

tions of convolutional layers dimensions. Time per epoch and

the resulting model size are reported in minutes and megabytes

(MB) respectively.

Figure 1: The decaying loss of pre-trained models.

Fraction Accuracy Delta

0% 0.0 baseline

25% 90.66 +90.66

50% 91.93 +1.27

75% 92.42 +0.49

100% 92.63 +0.21

Table 3: Change in accuracy with more training data.

Delta denotes a relative improvement.

Generally, collecting more training data will

improve the model accuracy and lead to better gen-

eralisation. We simulated, using the Prodigy li-

brary and ’train-curve’ recipe, an acquisition of

more data by training of NER model on fractions

(25%, 50%, 75% and 100%) of the training set and

evaluating on the test set. We indeed observed (Ta-

ble 3) a steady upward trend in improvement of ac-

curacy while using more training data, especially

in the last segment of data which indicates the ben-

eﬁt of further collecting more data.

4.3. Named-entity recognition model

The developed Med7 clinical named-entity recognition model was trained in total on 1212

documents, comprising 303 silver training examples augmented with gold annotated data from

the oﬃcial 303 documents from the n2c2 training data (cf. Table 1) and additionally manu-

ally gold annotated 606 documents, randomly sampled from discharge letters of MIMIC-III en-

suring that there are no documents present from the testing data. The manual annotation was

performed using Prodigy, an active learning annotation tool, following the general procedure

outlined in [46]. The baseline NER model for the active-learning support containing all seven

categories was trained on the oﬃcial 303 documents. The baseline NER model was used within

the Prodigy ’human-in-the-loop’ framework to suggest entities on unseen texts and a human an-

notator accepted or corrected model predictions, creating gold annotated examples. We obtained

the inter-annotator agreement F1 score of 0.924 between the gold n2c2 annotations and of our

two annotators and F1 score of 0.989 between our annotators. The explicit toke-level confu-

sion matrices along with summary statistics are presented in Table B.11, Table B.12 and Table

B.13 accordingly. For generating silver training data, we used spaCy python library for keyword

phrase matching with ’EntityRuler’ class along with linguistic pattern matching with exemplars

from the training data set. Drug names, both generic and brand names, were sourced from pub-

licly available online resources. Training results, token-level confusion matrix and evaluation

statistics are summarised in Table 4, Table 5 and Table 6 correspondingly.

Gold (n2c2) Silver Prodigy Total

Dosage 4227 2792 3437 10456

Drug 16257 10551 12687 39495

Duration 592 462 620 1674

Form 6657 4299 5056 16012

Frequency 6281 4317 5106 15704

Route 5460 3761 4554 13775

Strength 6694 4328 5246 16268

Number of documents 303 303 606 1212

Table 4: The distribution of annotated text spans in three data sets used for training of the NER model.

Predicted categories

Dosage Drug Duration Form Frequency Route Strength Missed Partial

True categories

Dosage 2225 0 6 10 24 1 16 200 199

Drug 2 9796 0 7 0 4 1 449 316

Duration 6 0 277 0 8 0 2 39 46

Form 38 31 0 3864 1 65 6 90 264

Frequency 1 3 4 5 3144 2 0 108 745

Route 3 4 0 43 1 3312 1 108 41

Strength 38 3 0 1 2 0 3304 650 232

Spurious 20 120 6 4 7 22 3

Table 5: Token-level confusion matrix of the predicted entities versus the ground truth labels. Spurious examples cor-

respond to predicted entity boundary and type which do not exist in ground-truth annotations and partial matches corre-

spond to predicted entity boundary overlap with golden annotation, but they are not the same. Missing entities correspond

to ground-truth annotation boundary which were not identiﬁed.

Strict Lenient

Precision Recall F1 Precision Recall F1

Dosage 0.879 0.831 0.854 0.957 0.904 0.931

Drug 0.954 0.926 0.941 0.984 0.956 0.971

Duration 0.817 0.733 0.773 0.953 0.854 0.901

Form 0.921 0.886 0.903 0.983 0.947 0.965

Frequency 0.801 0.784 0.792 0.989 0.969 0.979

Route 0.961 0.943 0.952 0.973 0.954 0.964

Strength 0.927 0.781 0.848 0.992 0.836 0.907

Average (micro) 0.916 0.871 0.893 0.982 0.933 0.957

Average (macro) 0.897 0.844 0.869 0.977 0.919 0.947

Table 6: The evaluation results of the NER model on the test set with 202 documents.

4.4. Translation to UK-CRIS data

One of the challenges in developing a robust clinical information extraction system, is in its

generalisability beyond the data distribution it was trained on. Accurate algorithms developed

using data from a small number of medical centres, have demonstrated their poor generalisability

when applied within a similar context to other medical centres. For example, in a recent study on

the algorithmic approach to early detection of sepsis [47], the training data were sourced from

electronic health records of two hospitals, while the data from a third hospital were used for test-

ing the developed algorithm. It has been demonstrated and discussed in details [48] that a highly

accurate predictive algorithm, validated on a fraction of data from the same two hospitals, failed

to achieve the same level of accuracy when tested on the data from the third hospital, not in-

cluded in the training process. Poor performance using the out-of-distribution (OOD) data poses

a signiﬁcant challenge on wider applications of the developed models and is highly important

when algorithms inform real-world decisions [49].

Clinical concepts Train Test Total

Dosage 298 48 346

Drug 3253 571 3824

Duration 1006 215 1221

Form 410 63 473

Frequency 1604 305 1909

Route 208 32 240

Strength 1338 276 1614

Number of texts 536 134 670

Table 7: Distribution of gold-annotated entities and text

summary statistics of the OxCRIS training and test data sets.

The number of unique tokens is computed by lowercasing

words.

We investigated how accurate the de-

veloped Med7 model, trained on MIMIC-

III electronic health records sourced from

the Beth Israel Deaconess Medical Center in

Massachusetts (United States), can be when

applied to CRIS electronic health records in

the United Kingdom. We selected a ran-

dom sample of 670 documents from the Ox-

ford Health NHS Foundation Trust (OFHT)

instance of UK-CRIS Network and asked a

clinician to annotate them for seven categories

following the oﬃcial guidelines of the n2c2

challenge.

The token-level confusion matrix and the

performance metrics of the Med7 model

trained on n2c2 data from MIMIC-III and ap-

plied to CRIS data from Oxford instance are

presented in Table C.14 and in Table 8 correspondingly. Direct comparison to the results pre-

sented in Table 6 (F1=0.762 vs. F1=0.944) clearly shows the problem of direct transferability of

NER models trained on diﬀerent data sources.

Before ﬁne-tuning on OxCRIS After ﬁne-tuning on OxCRIS

Precision Recall F1 Precision Recall F1

Dosage 0.826 0.396 0.535 0.656 0.833 0.734

Drug 0.912 0.968 0.939 0.975 0.977 0.976

Duration 0.951 0.107 0.192 0.883 0.934 0.908

Form 0.554 0.611 0.581 0.924 0.968 0.946

Frequency 0.912 0.332 0.487 0.941 0.944 0.942

Route 0.348 0.719 0.469 0.882 0.938 0.909

Strength 0.938 0.877 0.906 0.996 0.917 0.955

Average (micro) 0.864 0.681 0.762 0.941 0.947 0.944

Average (macro) 0.778 0.586 0.609 0.901 0.932 0.914

Table 8: The lenient evaluation results of the Med7 model using 134 test documents sourced from OxCRIS - the Oxford

Health NHS Foundation Trust from within the UK-CRIS electronic health records Network.

5. Discussion

The developed named-entity recognition model for clinical concepts in unstructured medical

records was trained to recognise seven categories, such as drug names, including both generic

and brand names, dosage of the drugs, their strength, the route of administration, prescription

duration and the frequency. The data for model development and testing was sourced from

the n2c2 challenge, comprising a collection of 303 and 202 documents for training and test-

ing respectively, which represent a sample from the MIMIC-III electronic health records. We

demonstrated (Section 4.2) that collecting more annotated examples would improve the model

accuracy and therefore implemented two approaches for obtaining more annotations: noisy la-

belling and active learning with ’human-in-the-loop’. For the noisy labelling, we create a list

of unique patterns for each of the seven categories, sourced from the training corpus and from

external resources available on the internet, and then used regular expression with string pattern

matching to assign labels to tokens. Our two annotators were trained by closely following the

oﬃcial 2018 n2c2 annotation guidelines and demonstrated a high level of inter-annotator agree-

ment among themselves (F1=0.989) as well as a high-level of concordance (F1=0.924) with the

gold-annotations provided by the organisers of 2018 n2c2 Challenge (cf. Table B.13).

The overall (micro-averaged) performance of the NER model across all seven categories was

F1=0.957 (0.893), with Precision=0.982 (0.916) and Recall=0.933 (0.871) for lenient (strict)

estimates. More detailed breakdown of the performance for each of the categories is presented

in Table 6. The performance for ’Duration’ and ’Frequency’ categories was poorer. There were

intrinsically fewer cases of ’Duration’ (∼1.5%) appeared in texts and these concepts were also

ambiguously annotated as mentioned in Section 3.2.5. A similar situation was also observed for

the ’Frequency’ category, where in spite of a good number of the annotated examples (∼14%),

the ambiguity in the presentation of text spans was higher, which resulted in a large number

of partial matches (cf. Table 5). Another reason for poor performance for both ’Duration’ and

’Frequency’ was due to inconsistent annotations, where the same text string appeared in both

categories.

Self-supervised pre-training of deep learning models has shown its eﬃciency in many NLP

task. We experimented with a number of architectural variations of the width and depth of con-

volutional layers as well as the size of the embedding rows. Empirically, and as conﬁrmed by

other studies [50], larger models, with more parameters, tend to achieve better results. Interest-

ingly, the larger model (Width=256, Depth=16, Embeddings=10000) outperformed the default

one (Width=96, Depth=4, Embeddings=2000) by a small margin (F1256=0.893 vs F196=0.884)

however, the diﬀerences were more visible for ’Duration’ (F1256 =0.773 vs F196=0.729) and

’Strength’ (F1256=0.848 vs F196 =0.801). The better performance resulted at the expense of the

training time, its size on a disk and the memory consumption. We publicly released the pre-

trained neural network weights for various architectures through the dedicated GitHub reposi-

tory4.

Another objective of this work was to estimate the degree of transferability of the developed

information extraction model to another clinical domain. We evaluated how the Med7 model,

trained on a collection of discharge letters from the intensive care unit in the US (MIMIC),

performed on the secondary care mental health medical records in the UK (CRIS). The Med7

model was purposely designed to recognise non-context related medical concepts, such as drug

4https://github.com/kormilitzin/med7

names, strength, dosage, duration, route, form and frequency of administration and we expected

to see a comparable level of the model performance across the both EHR systems. To consis-

tently validate the transferability of the Med7 model, a random sample of 670 gold-annotated

examples from OxCRIS were split into training (536) and test (134) data sets (cf. Table 7). We

compared the performance of the Med7 model without and with ﬁne-tuning on OxCRIS. The

direct application of Med7 on the testing set of 134 documents, resulted in a quite poor perfor-

mance (F1=0.762). We investigated the cases where the model was predicting incorrectly and

in the majority of them, the main reason for poor performance was due to diﬀerences in the lan-

guage presentation of the concepts. For examples, the model largely missed concepts labelled

as ’Frequency’ in OxCRIS, such as ”ON”, (”every night”), ”OD” (”every day”), ”BD” (”twice

daily”), ”OM” (”every morning”), ”mane” and ”nocte”. Then, we ﬁne-tuned the Med7 model

on the training set of OxCRIS (536 documents) and evaluated on the same testing set as before

of 134 documents. Despite the small number of training examples in OxCRIS, leveraging the

transfer learning approach of re-using the pre-trained Med7 model on MIMIC, resulted in higher

accuracy (F1=0.944) comparable with training and testing on the same domain (cf. Table 8).

One strength on this project is in the interoperability of the developed model with other

generic deep learning NLP libraries, such as HuggingFace and Thinc as well as straightforward

integration with pipelines developed under the spaCy framework. This allows to customise the

Med7 model and include other pipeline components, such as negation detection, entity rela-

tions extraction and to map the extracted concepts onto the universal medical language system

(UMLS). Normalisation of concepts to UMLS categories will allow to systematically parse elec-

tronic medical records into structured and consistent tabular form which will be ready for down-

stream epidemiological analyses. Additionally, the developed model naturally integrates into

the Prodigy annotation tool, which allows to eﬃciently collect more gold-annotated examples.

It is also worth mentioning that the Med7 model is designed to run on standard CPUs, rather

than expensive GPUs. This fact will allow researchers without access to expensive and complex

infrastructure to develop fast and robust pipelines for clinical natural language processing.

However, two limitations should be noted. First, is that some of the categories are naturally

underrepresented which impacts the accuracy of the NER model. It was observed empirically

that the number of annotated ’Duration’ entities was intrinsically skewed in the medical records,

in contrast to drug names and strength, making it more challenging to train a robust model to

accurately identify these entities. Interestingly, the same pattern of the number of reported men-

tions of the ’Duration’ category persists in both, MIMIC and OxCRIS data, which might be

indicative of a general clinical reporting pattern. A second limitation of this study is related

to a low number of the manually-annotated examples in OxCRIS, in order to run more rigours

evaluations of the transferability of the Med7 model across all seven categories.

Future research into the robust clinical information extraction system will need to further

address the feasibility of deploying the model in the UK-CRIS Network Trust members and

evaluate its transferability. The aim is to furnish clinical researchers with an open source and a

robust tool for structuring free-text patients’ data for downstream analytical tasks.

6. Conclusion

In this work we developed and validated a clinical named-entity recognition model for free-

text electronic health records. The model was developed using the MIMIC-III free-text data

and trained on a combination of the manually annotated data from the 2018 n2c2 challenge, on

a random sample from MIMIC-III with noisy labels and manually annotated data using active

learning with Prodigy. To maximise the utilisation of a large amount of unstructured free-text

data and alleviate the problem of training from limited data, we used self-supervised learning to

pre-train the weights of the NER neural network model. We demonstrated that transfer learning

plays an essential role in developing a robust model applicable across diﬀerent clinical domains

and the developed Med7 model does not require an expensive infrastructure and can be used

on standard machines with CPU. Further research is needed to improve recognition of naturally

underrepresented concepts and we are planning to address this problem, as well as extracted

concepts normalisation and UMLS linkage in our future releases of the Med7 model.

Acknowledgments

The study was funded by the National Institute for Health Research’s (NIHR) Oxford Health

Biomedical Research Centre (BRC-1215-20005). This work was supported by the UK Clini-

cal Records Interactive Search (UK-CRIS) system funded and developed by the NIHR Oxford

Health BRC at Oxford Health NHS Foundation Trust and the Department of Psychiatry, Univer-

sity of Oxford. AK, NV, QL, ANH are funded by the MRC Pathﬁnder Grant (MC-PC-17215).

We are thankful to the organisers of the n2c2 2018 Challenge for providing annotated corpus and

the annotation guidelines.

The views expressed are those of the authors and not necessarily those of the UK National

Health Service, the NIHR, or the UK Department of Health.

We would also like to acknowledge the work and support of the Oxford CRIS Team: Tanya

Smith, Head of Research Informatics and Biomedical Research Centre (BRC) CRIS Theme Lead

and Lulu Kane, Adam Pill and Suzanne Fisher, CRIS Academic Support and Information Ana-

lysts.

Appendix A. The evaluation schema for extracted concepts

In order to evaluate the output of the NER system, we adopted the notations developed for dif-

ferent categories of errors [51] and the evaluation schema introduced in SemEval’13 (cf. Eq.A.1).

The following types of evaluation errors were considered (Table A.9):

Error Type Gold Standard NER Prediction

Text span Label Text span Label

1 Correct (COR) aspirin Drug aspirin Drug

2 Incorrect (INC) 25 Strength 25 Dosage

3 Partial (PAR) Augmentin Drug Augmentin XR Drug

4 Partial (PAR) for 3 weeks Duration 3 weeks Duration

5 Partial (PAR) p.r.n. Frequency prn Frequency

6 Missing (MIS) tablet Form - -

7 Spurious (SPU) - - Codeine Drug

Table A.9: A list of examples of typical errors produced by the NER model.

where Correct(COR) represents a complete match of both, the annotation boundary and the

entity type. Incorrect(INC) is the case where at least one of the predicted boundary or the entity

type do not match. Partial(PAR) match corresponds to predicted entity boundary which overlaps

with ground-truth annotation, but they are not exactly the same. Missing(MIS) the case where

the ground-truth annotated boundary is not predicted by the NER, but the ground-truth string is

present in the gold-annotated corpus. Spurious(SPU) corresponds to predicted entity boundary

which does not exist in the gold-annotated corpus.

Possible (POS) =COR +I NC +PAR +MIS =T P +F N

Actual (ACT) =COR +INC +PAR +S PU =T P +F P

Precision =(COR +αPAR)/ACT

Recall =(COR +αPAR)/POS

(A.1)

Appendix B. Inter-annotator agreement analysis

We estimated the level of concordance between the gold-annotated corpus from the n2c2

2018 challenge and two trained annotators. The annotators closely followed the same annotation

guidelines as used in the challenge. Ten documents were sampled at random from 202 docu-

ments comprising the test set. The distribution of gold-annotated tokens and by two annotators

is presented in Table B.10.

Types of annotated entities Gold (n2c2) Annotator 1 Annotator 2

Dosage 128 139 139

Drug 519 530 526

Duration 28 31 32

Form 234 246 238

Frequency 193 196 201

Route 179 167 167

Strength 200 212 205

Number of documents 10 10 10

Table B.10: The number of the gold and manually annotated entities for the inter-annotator agreement evaluation corpus,

comprising ten randomly sampled texts from the test set of 202 documents.

Annotator 1

Dosage Drug Duration Form Frequency Route Strength Missed Partial

Gold (n2c2)

Dosage 104 0 1 3 0 0 2 17 4

Drug 0 473 0 3 0 1 0 27 21

Duration 0 0 19 0 0 0 0 2 7

Form 1 4 0 201 0 2 0 7 21

Frequency 1 0 0 0 172 0 1 2 17

Route 2 2 0 2 0 156 0 15 2

Strength 2 1 0 0 0 0 171 4 28

Spurious 25 29 4 16 7 6 10

Table B.11: Token-level confusion matrix of the annotator 1 versus the gold-standard annotations provided by 2018 n2c2

challenge.

We examined the cases where our two annotators labelled the concepts of interests diﬀerently

than those found in the gold-annotated data set provided by the n2c2 team.

Annotator 2

Dosage Drug Duration Form Frequency Route Strength Missed Partial

Gold (n2c2)

Dosage 104 0 1 3 0 0 2 17 4

Drug 0 472 0 3 0 1 0 30 20

Duration 0 0 19 0 0 0 0 2 7

Form 0 3 0 201 0 2 0 9 21

Frequency 0 0 1 0 172 0 0 2 18

Route 2 2 0 2 0 156 0 15 2

Strength 3 1 0 0 4 0 171 3 21

Spurious 26 28 4 8 7 6 10

Table B.12: Token-level confusion matrix of the annotator 2 versus the gold-standard annotations provided by 2018 n2c2

challenge

Annot. 1 vs. Gold Annot. 2 vs. Gold Annot. 1 vs. Annot. 2

Pr Re F1 Pr Re F1 Pr Re F1

Dosage 0.777 0.824 0.801 0.777 0.824 0.801 0.986 0.986 0.986

Drug 0.935 0.935 0.935 0.935 0.935 0.935 0.998 0.991 0.994

Duration 0.812 0.923 0.867 0.812 0.929 0.867 0.969 1.000 0.984

Form 0.933 0.941 0.937 0.933 0.941 0.937 1.000 0.967 0.983

Frequency 0.945 0.984 0.964 0.945 0.984 0.964 0.975 1.000 0.987

Route 0.946 0.883 0.913 0.946 0.883 0.913 1.000 1.000 1.000

Strength 0.941 0.946 0.944 0.941 0.946 0.944 1.000 0.962 0.981

Average (micro) 0.921 0.928 0.924 0.921 0.928 0.924 0.994 0.985 0.989

Average (macro) 0.901 0.921 0.911 0.901 0.921 0.911 0.991 0.986 0.988

Table B.13: The evaluation results of the inter-annotator agreement on a random selection of ten documents from the

202 test texts. A pair-wise comparison between each of the annotators and the gold-annotated documents as well as the

direct comparison between the both annotators.

Appendix C. Fine-tuning on UK-CRIS

Med7-predicted categories: before ﬁne-tuning on OxCRIS

Dosage Drug Duration Form Frequency Route Strength Missed Partial

Gold annotated

Dosage 18 0 0 0 0 0 12 17 1

Drug 0 535 0 0 0 0 0 18 15

Duration 0 0 18 0 1 0 0 158 1

Form 0 2 0 34 0 1 0 20 2

Frequency 0 7 0 25 86 40 1 114 7

Route 0 0 0 3 3 23 0 6 0

Strength 3 0 0 0 0 0 238 31 4

Spurious 1 44 1 1 8 2 3

Table C.14: Token-level confusion matrix of the Med7 model trained on MIMIC-III and applied to 134 manually anno-

tated documents from the Oxford instance (OxCRIS) of the UK-CRIS electronic medical records Network.

Med7-predicted categories: after ﬁne-tuning on OxCRIS

Dosage Drug Duration Form Frequency Route Strength Missed Partial

Gold annotated

Dosage 39 0 0 0 0 0 1 7 1

Drug 0 553 0 2 0 0 0 11 4

Duration 0 0 177 0 1 0 0 13 20

Form 0 0 0 61 1 1 0 0 0

Frequency 1 1 0 2 279 1 0 12 6

Route 0 0 0 0 0 30 0 2 0

Strength 16 1 0 0 0 0 242 6 11

Spurious 4 12 26 1 16 2 0

Table C.15: Token-level confusion matrix of the Med7 model trained on MIMIC-III and applied to 134 manually anno-

tated documents from the Oxford instance (OxCRIS) of the UK-CRIS electronic medical records Network.

References

[1] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word

representations, arXiv preprint arXiv:1802.05365 (2018).

[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all

you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.

[3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language

understanding, arXiv preprint arXiv:1810.04805 (2018).

[4] S. Velupillai, H. Suominen, M. Liakata, A. Roberts, A. D. Shah, K. Morley, D. Osborn, J. Hayes, R. Stewart,

J. Downs, et al., Using clinical natural language processing for health outcomes research: overview and actionable

suggestions for future advances, Journal of biomedical informatics 88 (2018) 11–19.

[5] H. Sch ¨

utze, C. D. Manning, P. Raghavan, Introduction to information retrieval, in: Proceedings of the international

communication of association for computing machinery conference, Vol. 4, 2008.

[6] H. Dalianis, Clinical text mining: Secondary use of electronic patient records, Springer, 2018.

[7] S. Wu, K. Roberts, S. Datta, J. Du, Z. Ji, Y. Si, S. Soni, Q. Wang, Q. Wei, Y. Xiang, et al., Deep learning in clinical

natural language processing: a methodical review, Journal of the American Medical Informatics Association 27 (3)

(2020) 457–470.

[8] O. Bodenreider, The uniﬁed medical language system (umls): integrating biomedical terminology, Nucleic acids

research 32 (suppl 1) (2004) D267–D270.

[9] H. Gurulingappa, A. M. Rajput, A. Roberts, J. Fluck, M. Hofmann-Apitius, L. Toldo, Development of a benchmark

corpus to support the automatic extraction of drug-related adverse eﬀects from medical case reports, Journal of

biomedical informatics 45 (5) (2012) 885–892.

[10] G. Zhou, J. Su, Named entity recognition using an hmm-based chunk tagger, in: proceedings of the 40th Annual

Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 473–

480.

[11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and

their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119.

[12] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014

conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[13] K. KS, S. Sangeetha, Secnlp: A survey of embeddings in clinical natural language processing, arXiv preprint

arXiv:1903.01039 (2019).

[14] J. Howard, S. Ruder, Universal language model ﬁne-tuning for text classiﬁcation, arXiv preprint arXiv:1801.06146

(2018).

[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A

robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).

[16] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining

for language understanding, in: Advances in neural information processing systems, 2019, pp. 5754–5764.

[17] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language represen-

tation model for biomedical text mining, Bioinformatics 36 (4) (2020) 1234–1240.

[18] K. Huang, J. Altosaar, R. Ranganath, Clinicalbert: Modeling clinical notes and predicting hospital readmission,

arXiv preprint arXiv:1904.05342 (2019).

[19] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott, Publicly available clinical

bert embeddings, arXiv preprint arXiv:1904.03323 (2019).

[20] M. Neumann, D. King, I. Beltagy, W. Ammar, Scispacy: Fast and robust models for biomedical natural language

processing, arXiv preprint arXiv:1902.07669 (2019).

[21] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi,

R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientiﬁc data 3 (2016) 160035.

[22] S. Henry, K. Buchan, M. Filannino, A. Stubbs, O. Uzuner, 2018 n2c2 shared task on adverse drug events and

medication extraction in electronic health records, Journal of the American Medical Informatics Association 27 (1)

(2019) 3–12.

[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database,

in: CVPR09, 2009.

[24] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’ar, C. L.

Zitnick, Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014). arXiv:1405.0312.

URL http://arxiv.org/abs/1405.0312

[25] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, R. Weischedel, Ontonotes: The 90% solution, in: Proceedings of

the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short

06, Association for Computational Linguistics, USA, 2006, p. 5760.

[26] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+Questions for Machine Comprehension of Text,

arXiv e-prints (2016) arXiv:1606.05250arXiv:1606.05250.

[27] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment analysis,

in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language

Technologies, Association for Computational Linguistics, Portland, Oregon, USA, 2011, pp. 142–150.

URL http://www.aclweb.org/anthology/P11-1015

[28] M. Hofer, A. Kormilitzin, P. Goldberg, A. Nevado-Holgado, Few-shot learning for named entity recognition in

medical text, arXiv preprint arXiv:1811.05468 (2018).

[29] L. Gligic, A. Kormilitzin, P. Goldberg, A. Nevado-Holgado, Named entity recognition in electronic health records

using transfer learning bootstrapped neural networks, Neural Networks 121 (2020) 132–139.

[30] Y. Wang, S. Sohn, S. Liu, F. Shen, L. Wang, E. J. Atkinson, S. Amin, H. Liu, A clinical text classiﬁcation paradigm

using weak supervision and deep representation, BMC medical informatics and decision making 19 (1) (2019) 1.

[31] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural

networks and incremental parsing, to appear (2017).

[32] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly Media, 2009.

[33] J. D. Choi, Dynamic feature induction: The last gist to the state-of-the-art, in: Proceedings of the 2016 Conference

of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

2016, pp. 271–281.

[34] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, D. McClosky, The stanford corenlp natural lan-

guage processing toolkit, in: Proceedings of 52nd annual meeting of the association for computational linguistics:

system demonstrations, 2014, pp. 55–60.

[35] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al.,

Transformers: State-of-the-art natural language processing, arXiv preprint arXiv:1910.03771 (2019).

[36] I. Montani, M. Honnibal, Prodigy: A new annotation tool for radically eﬃcient machine teaching, Artiﬁcial Intel-

ligence to appear (2018). arXiv:toappear.

[37] J. Serr `

a, A. Karatzoglou, Getting deep recommenders ﬁt: Bloom embeddings for sparse binary input/output net-

works, in: Proceedings of the Eleventh ACM Conference on Recommender Systems, 2017, pp. 279–287.

[38] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recog-

nition, arXiv preprint arXiv:1603.01360 (2016).

[39] Q. Xie, E. Hovy, M.-T. Luong, Q. V. Le, Self-training with noisy student improves imagenet classiﬁcation, arXiv

preprint arXiv:1911.04252 (2019).

[40] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: Advances in neural infor-

mation processing systems, 2013, pp. 1196–1204.

[41] I. Provilkov, D. Emelianenko, E. Voita, Bpe-dropout: Simple and eﬀective subword regularization, arXiv preprint

arXiv:1910.13267 (2019).

[42] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. Kantor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling, Not enough

data? deep learning to the rescue!, arXiv preprint arXiv:1911.03118 (2019).

[43] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. R´

e, Snorkel: Rapid training data creation with weak

supervision, The VLDB Journal (2019) 1–22.

[44] A. Trask, P. Michalak, J. Liu, sense2vec-a fast and accurate method for word sense disambiguation in neural word

embeddings, arXiv preprint arXiv:1511.06388 (2015).

[45] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eﬃcient estimation of word representations in vector space, arXiv

preprint arXiv:1301.3781 (2013).

[46] N. Vaci, Q. Liu, A. Kormilitzin, F. De Crescenzo, A. Kurtulmus, J. Harvey, B. O’Dell, S. Innocent, A. Tomlin-

son, A. Cipriani, et al., Natural language processing for structuring clinical text data on depression using uk-cris,

Evidence-Based Mental Health 23 (1) (2020) 21–26.

[47] M. A. Reyna, C. S. Josef, R. Jeter, S. P. Shashikumar, M. B. Westover, S. Nemati, G. D. Cliﬀord, A. Sharma,

Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019, Critical Care

Medicine 48 (2) (2020) 210.

[48] J. Morrill, A. Kormilitzin, A. Nevado-Holgado, S. Swaminathan, S. Howison, T. Lyons, The signature-based model

for early detection of sepsis from electronic health records in the intensive care unit, in: 2019 Computing in

Cardiology Conference (CinC). IEEE, 2019.

[49] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, B. Lakshminarayanan, Likelihood ratios for

out-of-distribution detection, in: Advances in Neural Information Processing Systems, 2019, pp. 14680–14691.

[50] C. Raﬀel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of

transfer learning with a uniﬁed text-to-text transformer, arXiv preprint arXiv:1910.10683 (2019).

[51] N. Chinchor, B. Sundheim, MUC-5 evaluation metrics, in: Fifth Message Understanding Conference (MUC-5):

Proceedings of a Conference Held in Baltimore, Maryland, August 25-27, 1993, 1993.

URL https://www.aclweb.org/anthology/M93-1007

ResearchGate has not been able to resolve any citations for this publication.

Natural language processing for structuring clinical text data on depression using UK-CRIS

Article

Full-text available

Feb 2020

Background Utilisation of routinely collected electronic health records from secondary care offers unprecedented possibilities for medical science research but can also present difficulties. One key issue is that medical information is presented as free-form text and, therefore, requires time commitment from clinicians to manually extract salient information. Natural language processing (NLP) methods can be used to automatically extract clinically relevant information. Objective Our aim is to use natural language processing (NLP) to capture real-world data on individuals with depression from the Clinical Record Interactive Search (CRIS) clinical text to foster the use of electronic healthcare data in mental health research. Methods We used a combination of methods to extract salient information from electronic health records. First, clinical experts define the information of interest and subsequently build the training and testing corpora for statistical models. Second, we built and fine-tuned the statistical models using active learning procedures. Findings Results show a high degree of accuracy in the extraction of drug-related information. Contrastingly, a much lower degree of accuracy is demonstrated in relation to auxiliary variables. In combination with state-of-the-art active learning paradigms, the performance of the model increases considerably. Conclusions This study illustrates the feasibility of using the natural language processing models and proposes a research pipeline to be used for accurately extracting information from electronic health records. Clinical implications Real-world, individual patient data are an invaluable source of information, which can be used to better personalise treatment.

The Signature-Based Model for Early Detection of Sepsis from Electronic Health Records in the Intensive Care Unit

Conference Paper

Full-text available

Dec 2019

Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019

Article

Full-text available

Dec 2019
CRIT CARE MED

Objectives: Sepsis is a major public health concern with significant morbidity, mortality, and healthcare expenses. Early detection and antibiotic treatment of sepsis improve outcomes. However, although professional critical care societies have proposed new clinical criteria that aid sepsis recognition, the fundamental need for early detection and treatment remains unmet. In response, researchers have proposed algorithms for early sepsis detection, but directly comparing such methods has not been possible because of different patient cohorts, clinical variables and sepsis criteria, prediction tasks, evaluation metrics, and other differences. To address these issues, the PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. Design: Participants submitted containerized algorithms to a cloud-based testing environment, where we graded entries for their binary classification performance using a novel clinical utility-based evaluation metric. We designed this scoring function specifically for the Challenge to reward algorithms for early predictions and penalize them for late or missed predictions and for false alarms. Setting: ICUs in three separate hospital systems. We shared data from two systems publicly and sequestered data from all three systems for scoring. Patients: We sourced over 60,000 ICU patients with up to 40 clinical variables for each hour of a patient's ICU stay. We applied Sepsis-3 clinical criteria for sepsis onset. Interventions: None. Measurements and main results: A total of 104 groups from academia and industry participated, contributing 853 submissions. Furthermore, 90 abstracts based on Challenge entries were accepted for presentation at Computing in Cardiology. Conclusions: Diverse computational approaches predict the onset of sepsis several hours before clinical recognition, but generalizability to different hospital systems remains a challenge.

Deep learning in clinical natural language processing: A methodical review

Article

Full-text available

Dec 2019

Objective: This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. Materials and methods: We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. Results: DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a "long tail" of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. Discussion: Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). Conclusion: Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.

Snorkel: rapid training data creation with weak supervision

Article

Full-text available

May 2020
VLDB J

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.

Universal Language Model Fine-tuning for Text Classification

Conference Paper

Full-text available

Jan 2018

A clinical text classification paradigm using weak supervision and deep representation

Article

Full-text available

Jan 2019
BMC MED INFORM DECIS

Background Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. Methods We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. Results CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. Conclusion The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

Using Clinical Natural Language Processing for Health Outcomes Research: Overview and Actionable Suggestions for Future Advances

Article

Full-text available

Oct 2018

The importance of incorporating Natural Language Processing(NLP) methods in clinical informatics research has been increasingly recognized over the past years, and has led to transformative advances. Typically, clinical NLP systems are developed and evaluated on word, sentence, or document level annotations that model specific attributes and features, such as document content (e.g., patient status, or report type), document section types (e.g., current medications, past medical history, or discharge summary), named entities and concepts (e.g., diagnoses, symptoms, or treatments) or semantic attributes (e.g., negation, severity, or temporality). From a clinical perspective, on the other hand, research studies are typically modelled and evaluated on a patient- or population-level, such as predicting how a patient group might respond to specific treatments or patient monitoring over time. While some NLP tasks consider predictions at the individual or group user level, these tasks still constitute a minority. Owing to the discrepancy between scientific objectives of each field, and because of differences in methodological evaluation priorities, there is no clear alignment between these evaluation approaches. Here we provide a broad summary and outline of the challenging issues involved in defining appropriate intrinsic and extrinsic evaluation methods for NLP research that is to be used for clinical outcomes research, and vice-versa. A particular focus is placed on mental health research, an area still relatively understudied by the clinical NLP research community, but where NLP methods are of notable relevance. Recent advances in clinical NLP method development have been significant, but we propose more emphasis needs to be placed on rigorous evaluation for the field to advance further. To enable this, we provide actionable suggestions, including a minimal protocol that could be used when reporting clinical NLP method development and its evaluation.

2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records

Article

Oct 2019
J AM MED INFORM ASSN

Objective: This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it. Materials and methods: For all tasks, teams were given raw text of narrative discharge summaries, and in all the tasks, participants proposed deep learning-based methods with hand-designed features. In the concept extraction task, participants used sequence labelling models (bidirectional long short-term memory being the most popular), whereas in the relation classification task, they also experimented with instance-based classifiers (namely support vector machines and rules). Ensemble methods were also popular. Results: A total of 28 teams participated in task 1, with 21 teams in tasks 2 and 3. The best performing systems set a high performance bar with F1 scores of 0.9418 for concept extraction, 0.9630 for relation classification, and 0.8905 for end-to-end. However, the results were much lower for concepts and relations of Reasons and ADEs. These were often missed because local context is insufficient to identify them. Conclusions: This challenge shows that clinical concept extraction and relation classification systems have a high performance for many concept types, but significant improvement is still required for ADEs and Reasons. Incorporating the larger context or outside knowledge will likely improve the performance of future systems.

Named entity recognition in electronic health records using transfer learning bootstrapped Neural Networks

Article

Sep 2019
NEURAL NETWORKS

Neural networks (NNs) have become the state of the art in many machine learning applications, such as image, sound (LeCun et al., 2015) and natural language processing (Young et al., 2017; Linggard et al., 2012). However, the success of NNs remains dependent on the availability of large labelled datasets, such as in the case of electronic health records (EHRs). With scarce data, NNs are unlikely to be able to extract this hidden information with practical accuracy. In this study, we develop an approach that solves these problems for named entity recognition, obtaining 94.6 F1 score in I2B2 2009 Medical Extraction Challenge (Uzuner et al., 2010), 4.3 above the architecture that won the competition. To achieve this, we bootstrap our NN models through transfer learning by pretraining word embeddings on a secondary task performed on a large pool of unannotated EHRs and using the output embeddings as a foundation of a range of NN architectures. Beyond the official I2B2 challenge, we further achieve 82.4 F1 on extracting relationships between medical terms using attention-based seq2seq models bootstrapped in the same manner.

Med7: a transferable clinical natural language processing model for electronic health records

Abstract and Figures

Recommended publications

Med7: A transferable clinical natural language processing model for electronic health records

Validation of UK Biobank Data for Mental Health Outcomes: A Pilot Study Using Secondary Care Electro...

An efficient representation of chronological events in medical texts

Maximizing the use of social and behavioural information from secondary care mental health electroni...