PreprintPDF Available

Removing confounding information from fetal ultrasound images

Authors:
  • Region Hovedstaden, København, Denmark
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Confounding information in the form of text or markings embedded in medical images can severely affect the training of diagnostic deep learning algorithms. However, data collected for clinical purposes often have such markings embedded in them. In dermatology, known examples include drawings or rulers that are overrepresented in images of malignant lesions. In this paper, we encounter text and calipers placed on the images found in national databases containing fetal screening ultrasound scans, which correlate with standard planes to be predicted. In order to utilize the vast amounts of data available in these databases, we develop and validate a series of methods for minimizing the confounding effects of embedded text and calipers on deep learning algorithms designed for ultrasound, using standard plane classification as a test case.
Content may be subject to copyright.
Removing confounding information from fetal
ultrasound images?
Kamil Mikolaj1,2, Manxi Lin1, Zahra Bashir2, Morten Bo Søndergaard
Svendsen2, Martin Tolsgaard2, Anders Nymark1, and Aasa Feragen1
1DTU Compute, Technical University of Denmark
2CAMES Rigshospitalet
{kmik, afhar}@dtu.dk
Abstract. Confounding information in the form of text or markings em-
bedded in medical images can severely affect the training of diagnostic
deep learning algorithms. However, data collected for clinical purposes
often have such markings embedded in them. In dermatology, known ex-
amples include drawings or rulers that are overrepresented in images of
malignant lesions. In this paper, we encounter text and calipers placed
on the images found in national databases containing fetal screening ul-
trasound scans, which correlate with standard planes to be predicted. In
order to utilize the vast amounts of data available in these databases, we
develop and validate a series of methods for minimizing the confounding
effects of embedded text and calipers on deep learning algorithms de-
signed for ultrasound, using standard plane classification as a test case.
Keywords: Removing confounding information ·Model Bias ·Fetal Ul-
trasound ·Standard Plane Classification.
1 Introduction
Clinical data is a great potential source of data for training medical imaging
models: It can come in vast amounts, and represents the nature of data quality
encountered in clinical practice. However, when data comes from the clinic, there
is little control of the data generating process, and the data may be affected
in ways that are suboptimal for training deep learning models. In particular,
clinical images sometimes come with embedded text, markings, calipers or other
annotations made by the clinician, and these likely carry information correlating
with the predictive task at hand. It has recently been shown that markings,
stickers and rulers present in dermatological images can confound predictors
that aim to diagnose skin lesions [6,12,7]. In this paper, we consider confounding
information present in fetal ultrasound images from clinical screening. As shown
in Fig. 1, these images often have text and calipers embedded in them, which can
affect predictors trained on the images. As a case study, we use standard plane
classification, which aims to automatically recognize those ultrasound planes
required for particular types of screening tests during pregnancy.
?Supported by organization x.
arXiv:2303.13918v1 [cs.CV] 24 Mar 2023
2 F. Author et al.
Fig. 1. Examples of clinical ultrasound images with text and calipers, i.e. annotated
coordinates for measuring anatomical objects, embedded into the image. Note that
both the text labels and the caliper geometry carries information about the particular
standard plane that the image contains.
Our contribution First, we quantify the confounding effect of text and
calipers embedded in fetal ultrasound images used for training neural networks.
Next, we quantitatively assess the success of different methods that aim to re-
move these confounding effects, ranging from naïve to state of the art text in-
painting methods developed for natural images. We show that even simple meth-
ods that mask out the confounding information ensure improved generalization
to images that do not contain confounding information.
1.1 Related work
Recent work in dermatological imaging pointed out that pen markings or rulers
that are often present in images of malignant skin lesions from clinical practice,
were actually confounding skin lesion diagnosis performed by a CNN approved
as a medical device [12,6]. More precisely, it was shown that the predictive mod-
els performed better on images with rulers and markings than on those without.
This discovery spurred research into methods for removing this confounding ef-
fect, such as segmenting the lesion as a preprocessing step [5] to avoid looking
at context; inpainting stickers present on benign images to remove them from
the training images [7]; inserting prior knowledge into the models [1,8]; or ad-
versarially training the neural network to be unable to recognize whether the
confounding information is present [2].
In our setting, the confounding information is typically embedded into the
clinically relevant part of the image. As a result, we cannot apply methods that
remove context by segmenting out the object of interest. Instead, we focus on
validating both simple and more complex models for removing and inpainting
confounding text and calipers.
Removing confounding information from fetal ultrasound images 3
2 Methodology
We assess the confounding effects of embedded text and calipers using standard
plane classification for 3rd trimester growth scans as a test case. In order to assess
the weight of the fetus, as is commonly done in the 3rd trimester, the clinician
needs to obtain standard planes for the head, abdomen and femur. As a typical
application is to recognize good standard planes from nonstandard planes that
cannot be used for standardized measurements, we also include images that are
not standard planes, which should ideally be classified as "Other".
We assess six different methods for removing confounding information. Ex-
amples of images inpainted with the different methods are found in Fig. 2.
2.1 Simple methods for removing confounding effects
The initial four methods consist of first detecting the text and calipers and then
replacing them in various ways.
The text and calipers embedded in the clinically relevant part of the image
is (in our training set) always yellow; these are detected via thresholding in hue,
saturation, value (HSV) space to segment yellow features. The resulting mask is
dilated with a 3x3 structuring element to enlarge the segmentation and connect
neighboring elements.
Additionally, as can be seen in Fig. 1, most images contain some gray and
blue text in the top right corner. To remove this, we first remove everything
around the conical ultrasound field of view. A mask is obtained by thresholding
the cone in HSV, finding the largest connected component and filling the holes,
after which everything else can be masked out. For cases where some blue and
grey text is on top of the field of view, the blue text is detected, and everything
above is replaced by a black box.
We next consider various approaches to inpainting the yellow masks.
Black box. In this first simple approach, the detected yellow mask is overlaid
by black boxes spanned by the minimum and maximum x- and y- coordinates
found within every connected component of the mask.
Replacing confounding information by noise. As the inpainted black boxes
leaves clearly visible information on relative position, caliper geometry etc, we
next replace the missing information with noise. For every connected compo-
nent of the mask, we find its minimal bounding box as above, and expand the
bounding box given by the coordinates by 5 pixels in each direction. Next, the
contents of the box is replaced by noise as follows: The mean µand standard
deviation σare computed from the values of those bounding box pixels which
are not segmented as belonging to text/markings. Those pixels segmented as
text/markings are then replaced with noise sampled from a normal distribution
N(µ, σ/10). The scaling of the standard deviation was performed to reduce the
visual effect of replacing image pixels with noise; the scaling factor selected from
{1,1
10 ,1
100 }by optimizing validation set accuracy.
4 F. Author et al.
Fig. 2. Illustration of inpainting methods replacing both text and calipers on two
example image patches.
Bilinear interpolation. Third, we apply bilinear interpolation to inpaint the
mask.
Fast marching inpainting. Next, we apply a fast marching inpainting method
available in OpenCV [11] to inpaint the mask.
2.2 Deep learning methods for removing confounding effects
Next, we assess two different deep learning approaches that aim to generate
confounder-free images. Due to the memory cost, the images are divided into
300 ×240 pixels patches for model training and inference.
U-Net. First, we train a U-net [9] to generate images without confounding
information. We regress the text-free image directly by designing the U-net so
that its input and output both have three channels. The pixel intensity of the
input images is rescaled to [1,1], and the network output is activated by the
hyperbolic tangent function. The target labels are images inpainted by bilinear
interpolation. We use the SGD optimizer with a learning rate of 0.001 and mon-
itor the mean square root loss. The network is trained for 100 epochs with batch
size 8.
GAN for generating text-free images. As an alternative, we train a GAN [3],
which is a state-of-the-art model for scene text removal in natural images. We
follow the training settings from the original work. The training set is synthe-
sized by placing the markings detected by thresholding randomly on the images
inpainted by bilinear interpolation. This is a common way to construct datasets
in scene text removal.
Removing confounding information from fetal ultrasound images 5
2.3 Quantifying the effect of confounders
In order to quantify confounding effects, we train standard plane classification
models to classify standard planes for the growth scans typically performed
during the 3rd trimester. For such scans, one typically collects the standard
planes for head, abdomen, and femur.
We train our models first on raw images, and next on images where the con-
founding text and calipers are removed using the approaches listed above. Both
types of models are tested on an internal clinical dataset consisting of images
with confounders; an internal dataset consisting of images without confounders,
as well as an external dataset from a different country, where the images do not
contain confounders.
3 Experiments and Results
3.1 Standard plane classification models
We train the Efficientnet B3 architecture [10] for standard plane classification
using AdamW with a learning rate of 1e4, with weighted cross entropy loss
to adjust the training to imbalanced data. No augmentation is applied. The
classifier is trained for at most 50 epochs, using early stopping with a patience
of 5 epochs. We used PyTorch 1.10 for all deep learning based models.
3.2 Data
Internal database. We base our evaluation on 3rd trimester growth screening
images from the national fetal ultrasound screening programme from ANONY-
MOUS COUNTRY. The data was collected and used with permission from the
ANONYMISED. These images were annotated by an OB-Gyn resident as be-
ing either head, abdomen, femur or other, as these are the relevant standard
planes for growth estimation. Note, however, that as these images came from
the clinical screening database, from a trimester where screening scans are not
all made by expert sonographers, they were not all perfect standard planes. To
take this into account, the images were given a quality score from 0 (poor) to 10
(excellent), which is shown in Table 2.
Training data. We performed several experiments with different configura-
tions of the training/validation/test split, all configured with no subject overlap
between splits.
We use two different training set configurations for our experiments. All
training data is selected from the internal database. The class demographics for
both configurations are found in Table 1.
In the first configuration, the training set contains only images with text and
calipers. In the second configuration, the training set contains both images with
(77%) and without (23%) text and calipers, sampled in such a way that we have
a similar numbers of images per class as in the first configuration.
6 F. Author et al.
Plane Train 1 Val 1 Train 2 Val 2 Local Test Local Test External Test
(with (no (no
confounders) confounders) confounders)
Head 669 147 670 134 127 121 3092
Abdomen 786 169 774 171 127 119 711
Femur 601 118 598 121 127 113 1040
Other 774 181 746 161 42 38 4213
N 2830 615 2788 587 423 391 9056
Table 1. Dataset demographics different experiments, detailing training, validation
and test splits from the local and external databases.
Test data We test both on data from our own clinical screening database, and
on an external dataset from another country [4].
From the internal database we extracted two test sets: One with images
that contain text and calipers, and one with images that do not. The test sets
were designed to be identically distributed across the classes in order to get
comparable performance scores.
Images without text were automatically extracted from the database based
on HSV thresholding of the yellow color corresponding to the text. Since the
scanner’s model name is also yellow and is placed in the black area outside of the
ultrasound field of view, yellow areas that are surrounded by black background
are excluded. This is accomplished by morphological dilation of the given area
to obtain its neighbouring pixels; if the mean of the neighbours is equal to the
background, then it is excluded.
Since the text and calipers are placed by the clinican on those images that
are in the end used for predicting clinical outcomes, there is an expected drop
in quality for those images that do not contain text and calipers (see Table 2).
The external database contains ultrasound scans from 2nd and 3rd trimester,
classified into a range of different standard planes, whereas our internal data only
contains 3rd trimester. As we use the classes "head", "abdomen" and "femur",
the remaining images are given the class "other", which is likely differently dis-
tributed than the corresponding images from the internal test sets.
Plane Internal data Internal data
(with confounders) (no confounders)
Head 3.76 ±1.74 2.92 ±2.23
Abdomen 3.16 ±1.63 2.46 ±1.98
Femur 5.13 ±1.95 4.46 ±2.10
Table 2. The quality score of given standard planes used in internal data tests. Note
the lower quality of the ’no confounders’ data.
Removing confounding information from fetal ultrasound images 7
3.3 Assessing confounding effects
For each experiment, the training and validation sets were resampled 10 times.
Performance was compared to the baseline using t-tests for equality of means of
the accuracies, reporting p-values computed over the 10 repeated runs.
3.4 Experimental results
Results for the different training configurations for the standard plane classifi-
cation are found in Tables 3 and 4.
Method Internal data pval Internal data pval External pval
(with confounders) (no confounders) (no confounders)
Baseline 97.0% ±1.1% - 85.6% ±4.1% - 80.5% ±4.0% -
BlackBox 96.3% ±1.1% 0.17 91.7% ±1.4% 2.8e-04 79.2% ±2.5% 0.38
Noise 95.8% ±1.6% 0.070 92.9% ±1.1% 2.9e-05 80.6% ±2.2% 0.97
FastMarching 96.7% ±1.6% 0.67 93.8% ±1.3% 1.0e-05 80.8% ±2.6% 0.82
Interpolation 96.1% ±1.5% 0.16 93.7% ±1.2% 1.0e-05 80.3% ±2.4% 0.91
GAN 95.9% ±1.7% 0.10 92.7% ±1.0% 4.1e-05 80.4% ±3.1% 0.96
U-net 96.5% ±1.3% 0.40 93.8% ±0.9% 7e-06 79.4% ±3.0% 0.49
Table 3. Classification results training only on images with confounders.
Note that while there is no significant difference between the methods on
the data with confounders, there is a significant difference to the baseline for all
methods on the data with no confounders.
Method Internal data pval Internal data pval External pval
(with confounders) (no confounders) (no confounders)
Baseline 96.6% ±1.0% - 87.4% ±2.9% - 84.7% ±3.2% -
BlackBox 96.0% ±1.0% 0.21 92.5% ±1.9% 2.1e-04 80.9% ±3.1% 0.013
Noise 95.9% ±1.5% 0.23 93.5% ±2.3% 6.9e-05 82.9% ±3.1% 0.21
FastMarching 96.3% ±1.3% 0.59 94.5% ±1.8% 4.0e-06 83.1% ±2.6% 0.22
Interpolation 96.6% ±0.6% 1.0 94.6% ±1.0% 1.0e-06 82.8% ±1.8% 0.11
GAN 96.1% ±1.4% 0.31 93.6% ±1.5% 1.3e-05 83.0% ±1.5% 0.13
U-net 96.9% ±1.6% 0.70 94.7% ±2.1% 5e-06 81.3% ±2.1% 0.011
Table 4. Classification results training both on images with and without confounders.
4 Discussion and conclusion
We have shown that deep learning algorithms can be confounded when trained
on clinical ultrasound images with embedded text or calipers. We have compared
8 F. Author et al.
several methods for removing text and calipers, ranging from simple detection,
removal and classical inpainting or interpolation, to state-of-the-art deep learn-
ing models developed to remove text from natural images. All methods have a
positive effect by bringing classification performance on clean test images closer
to the performance on test images with embedded confounders, even though
several of them leave visible artefacts that carry spatial information about the
removed confounders. Moreover, the simple methods are performing on par with
or even better than the deep learning models. One reason might be that while
the deep learning models re-predict the entire image, the simple methods only
replace those parts of the image corrupted by text and calipers. Another draw-
back is that the neural networks are learning texture features for the whole image
simultaneously, and inpainting text and calipers with such generic textures may
be less beneficial than inpainting with locally inferred texture.
In terms of computational cost, at inference time the deep learning models
are about twice as fast as the classical methods. However, the deep learning
models also require training time. Moreover, as the classical methods run on
CPU, they could likely compete with the inference speed of the deep learning
models if they were also implemented on GPU.
We note that the performance on clean images from the internal dataset
is still slightly below the performance on images with embedded confounders.
This could be due to the generally lower quality of those images that do not have
embedded text and calipers. The images without text and calipers are those that
were not chosen as standard plane representatives of the clinician the highest
quality images are the ones used for the clinical calculations and measurements.
Why is it so important to be able to train deep learning algorithms
on clinical quality images? Why don’t we, instead, perform our own data ac-
quisition obtaining images of the sought quality, but without embedded calipers
and text? For standard plane classification, this might be feasible, but would still
leave us with far less training data than national screening databases can pro-
vide. Even more importantly, however, national screening databases also come
with potential for linking with registries cataloguing patient outcomes. Such reg-
istries would allow us to train models to recognize rare anomalies and diseases,
which we would have no guarantee of finding represented in a smaller dataset
acquired for the task. In order to train such models, we need to be able to train
our networks robustly, without being affected by confounding information.
Why do we try to fix the data, when instead we could try to fix
the algorithm? Indeed, it is desirable to develop algorithms that are funda-
mentally robust to confounding information. Existing approaches to this problem
rely heavily on application specific prior knowledge, such as being able to seg-
ment the confounding information away from the image [1]. As, in our case, the
confounding information sits right on top of the most relevant part of the image,
these approaches do not carry over. By understanding how we might improve
training by debiasing our data, we believe we will be better equipped, down the
line, to develop algorithms that are inherently robust to confounders and bias.
Removing confounding information from fetal ultrasound images 9
Acknowledgements The work was partly funded in part by the Innovation
Fund Denmark for the project DIREC (9142-00001B); The Capital Region Re-
search Fund and The AI Signature Project, Danish Regions; and the Novo
Nordisk Foundation through the Center for Basic Machine Learning Research in
Life Science (NNF20OC0062606). The authors acknowledge the Pioneer Centre
for AI, DNRF grant P1.
References
1. Barnett, A.J., Schwartz, F.R., Tao, C., Chen, C., Ren, Y., Lo, J.Y., Rudin, C.: A
case-based interpretable deep learning model for classification of mass lesions in
digital mammography. Nature Machine Intelligence 3(12), 1061–1070 (2021)
2. Bevan, P.J., Atapour-Abarghouei, A.: Skin deep unlearning: Artefact and in-
strument debiasing in the context of melanoma classification. arXiv preprint
arXiv:2109.09818 (2021)
3. Bian, X., Wang, C., Quan, W., Ye, J., Zhang, X., Yan, D.M.: Scene text removal
via cascaded text stroke detection and erasing. Computational Visual Media 8(2),
273–287 (2022)
4. Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-
Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional
neural networks for automatic classification of common maternal fetal ultrasound
planes. Scientific Reports 10(1), 1–12 (2020)
5. Maron, R.C., Hekler, A., Krieghoff-Henning, E., Schmitt, M., Schlager, J.G.,
Utikal, J.S., Brinker, T.J.: Reducing the impact of confounding factors on skin
cancer classification via image segmentation: technical model study. Journal of
Medical Internet Research 23(3), e21695 (2021)
6. Narla, A., Kuprel, B., Sarin, K., Novoa, R., Ko, J.: Automated classification of
skin lesions: from pixels to practice. Journal of Investigative Dermatology 138(10),
2108–2110 (2018)
7. Nauta, M., Walsh, R., Dubowski, A., Seifert, C.: Uncovering and correcting short-
cut learning in machine learning models for skin cancer diagnosis. Diagnostics
12(1), 40 (2022)
8. Rieger, L., Singh, C., Murdoch, W., Yu, B.: Interpretations are useful: penaliz-
ing explanations to align neural networks with prior knowledge. In: International
conference on machine learning. pp. 8116–8126. PMLR (2020)
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234–241. Springer (2015)
10. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural
networks. In: International conference on machine learning. pp. 6105–6114. PMLR
(2019)
11. Telea, A.: An image inpainting technique based on the fast marching method. Jour-
nal of Graphics Tools 9(01 2004). https://doi.org/10.1080/10867651.2004.
10487596
12. Winkler, J.K., Fink, C., Toberer, F., Enk, A., Deinlein, T., Hofmann-Wellenhof,
R., Thomas, L., Lallas, A., Blum, A., Stolz, W., et al.: Association between surgical
skin markings in dermoscopic images and diagnostic performance of a deep learn-
ing convolutional neural network for melanoma recognition. JAMA dermatology
155(10), 1135–1141 (2019)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Machine learning models have been successfully applied for analysis of skin images. However, due to the black box nature of such deep learning models, it is difficult to understand their underlying reasoning. This prevents a human from validating whether the model is right for the right reasons. Spurious correlations and other biases in data can cause a model to base its predictions on such artefacts rather than on the true relevant information. These learned shortcuts can in turn cause incorrect performance estimates and can result in unexpected outcomes when the model is applied in clinical practice. This study presents a method to detect and quantify this shortcut learning in trained classifiers for skin cancer diagnosis, since it is known that dermoscopy images can contain artefacts. Specifically, we train a standard VGG16-based skin cancer classifier on the public ISIC dataset, for which colour calibration charts (elliptical, coloured patches) occur only in benign images and not in malignant ones. Our methodology artificially inserts those patches and uses inpainting to automatically remove patches from images to assess the changes in predictions. We find that our standard classifier partly bases its predictions of benign images on the presence of such a coloured patch. More importantly, by artificially inserting coloured patches into malignant images, we show that shortcut learning results in a significant increase in misdiagnoses, making the classifier unreliable when used in clinical practice. With our results, we, therefore, want to increase awareness of the risks of using black box machine learning models trained on potentially biased datasets. Finally, we present a model-agnostic method to neutralise shortcut learning by removing the bias in the training dataset by exchanging coloured patches with benign skin tissue using image inpainting and re-training the classifier on this de-biased dataset.
Article
Full-text available
Interpretability in machine learning models is important in high-stakes decisions such as whether to order a biopsy based on a mammographic exam. Mammography poses important challenges that are not present in other computer vision tasks: datasets are small, confounding information is present and it can be difficult even for a radiologist to decide between watchful waiting and biopsy based on a mammogram alone. In this work we present a framework for interpretable machine learning-based mammography. In addition to predicting whether a lesion is malignant or benign, our work aims to follow the reasoning processes of radiologists in detecting clinically relevant semantic features of each image, such as the characteristics of the mass margins. The framework includes a novel interpretable neural network algorithm that uses case-based reasoning for mammography. Our algorithm can incorporate a combination of data with whole image labelling and data with pixel-wise annotations, leading to better accuracy and interpretability even with a small number of images. Our interpretable models are able to highlight the classification-relevant parts of the image, whereas other methods highlight healthy tissue and confounding information. Our models are decision aids—rather than decision makers—and aim for better overall human–machine collaboration. We do not observe a loss in mass margin classification accuracy over a black box neural network trained on the same data.
Article
Full-text available
Recent learning-based approaches show promising performance improvement for the scene text removal task but usually leave several remnants of text and provide visually unpleasant results. In this work, a novel end-to-end framework is proposed based on accurate text stroke detection. Specifically, the text removal problem is decoupled into text stroke detection and stroke removal; we design separate networks to solve these two subproblems, the latter being a generative network. These two networks are combined as a processing unit, which is cascaded to obtain our final model for text removal. Experimental results demonstrate that the proposed method substantially outperforms the state-of-the-art for locating and erasing scene text. A new large-scale real-world dataset with 12,120 images has been constructed and is being made available to facilitate research, as current publicly available datasets are mainly synthetic so cannot properly measure the performance of different methods.
Article
Full-text available
Background Studies have shown that artificial intelligence achieves similar or better performance than dermatologists in specific dermoscopic image classification tasks. However, artificial intelligence is susceptible to the influence of confounding factors within images (eg, skin markings), which can lead to false diagnoses of cancerous skin lesions. Image segmentation can remove lesion-adjacent confounding factors but greatly change the image representation. Objective The aim of this study was to compare the performance of 2 image classification workflows where images were either segmented or left unprocessed before the subsequent training and evaluation of a binary skin lesion classifier. Methods Separate binary skin lesion classifiers (nevus vs melanoma) were trained and evaluated on segmented and unsegmented dermoscopic images. For a more informative result, separate classifiers were trained on 2 distinct training data sets (human against machine [HAM] and International Skin Imaging Collaboration [ISIC]). Each training run was repeated 5 times. The mean performance of the 5 runs was evaluated on a multi-source test set (n=688) consisting of a holdout and an external component. ResultsOur findings showed that when trained on HAM, the segmented classifiers showed a higher overall balanced accuracy (75.6% [SD 1.1%]) than the unsegmented classifiers (66.7% [SD 3.2%]), which was significant in 4 out of 5 runs (P
Article
Full-text available
The goal of this study was to evaluate the maturity of current Deep Learning classification techniques for their application in a real maternal-fetal clinical environment. A large dataset of routinely acquired maternal-fetal screening ultrasound images (which will be made publicly available) was collected from two different hospitals by several operators and ultrasound machines. All images were manually labeled by an expert maternal fetal clinician. Images were divided into 6 classes: four of the most widely used fetal anatomical planes (Abdomen, Brain, Femur and Thorax), the mother’s cervix (widely used for prematurity screening) and a general category to include any other less common image plane. Fetal brain images were further categorized into the 3 most common fetal brain planes (Trans-thalamic, Trans-cerebellum, Trans-ventricular) to judge fine grain categorization performance. The final dataset is comprised of over 12,400 images from 1,792 patients, making it the largest ultrasound dataset to date. We then evaluated a wide variety of state-of-the-art deep Convolutional Neural Networks on this dataset and analyzed results in depth, comparing the computational models to research technicians, which are the ones currently performing the task daily. Results indicate for the first time that computational models have similar performance compared to humans when classifying common planes in human fetal examination. However, the dataset leaves the door open on future research to further improve results, especially on fine-grained plane categorization.
Article
Full-text available
Importance Deep learning convolutional neural networks (CNNs) have shown a performance at the level of dermatologists in the diagnosis of melanoma. Accordingly, further exploring the potential limitations of CNN technology before broadly applying it is of special interest. Objective To investigate the association between gentian violet surgical skin markings in dermoscopic images and the diagnostic performance of a CNN approved for use as a medical device in the European market. Design and Setting A cross-sectional analysis was conducted from August 1, 2018, to November 30, 2018, using a CNN architecture trained with more than 120 000 dermoscopic images of skin neoplasms and corresponding diagnoses. The association of gentian violet skin markings in dermoscopic images with the performance of the CNN was investigated in 3 image sets of 130 melanocytic lesions each (107 benign nevi, 23 melanomas). Exposures The same lesions were sequentially imaged with and without the application of a gentian violet surgical skin marker and then evaluated by the CNN for their probability of being a melanoma. In addition, the markings were removed by manually cropping the dermoscopic images to focus on the melanocytic lesion. Main Outcomes and Measures Sensitivity, specificity, and area under the curve (AUC) of the receiver operating characteristic (ROC) curve for the CNN’s diagnostic classification in unmarked, marked, and cropped images. Results In all, 130 melanocytic lesions (107 benign nevi and 23 melanomas) were imaged. In unmarked lesions, the CNN achieved a sensitivity of 95.7% (95% CI, 79%-99.2%) and a specificity of 84.1% (95% CI, 76.0%-89.8%). The ROC AUC was 0.969. In marked lesions, an increase in melanoma probability scores was observed that resulted in a sensitivity of 100% (95% CI, 85.7%-100%) and a significantly reduced specificity of 45.8% (95% CI, 36.7%-55.2%, P < .001). The ROC AUC was 0.922. Cropping images led to the highest sensitivity of 100% (95% CI, 85.7%-100%), specificity of 97.2% (95% CI, 92.1%-99.0%), and ROC AUC of 0.993. Heat maps created by vanilla gradient descent backpropagation indicated that the blue markings were associated with the increased false-positive rate. Conclusions and Relevance This study’s findings suggest that skin markings significantly interfered with the CNN’s correct diagnosis of nevi by increasing the melanoma probability scores and consequently the false-positive rate. A predominance of skin markings in melanoma training images may have induced the CNN’s association of markings with a melanoma diagnosis. Accordingly, these findings suggest that skin markings should be avoided in dermoscopic images intended for analysis by a CNN. Trial Registration German Clinical Trial Register (DRKS) Identifier: DRKS00013570
Article
Full-text available
Digital inpainting provides a means for reconstruction of small damaged portions of an image. Although the inpainting basics are straightforward, most inpainting techniques published in the literature are complex to understand and implement. We present here a new algorithm for digital inpainting based on the fast marching method for level set applications. Our algorithm is very simple to implement, fast, and produces nearly identical results to more complex, and usually slower, known methods. Source code is available online.
Article
The letters "Interpretation of the Outputs of Deep Learning Model trained with Skin Cancer Dataset" and "Automated Dermatological Diagnosis: Hype or Reality?" highlight the opportunities, hurdles, and possible pitfalls with the development of tools that allow for automated skin lesion classification. The potential clinical impact of these advances relies on their scalability, accuracy, and generalizability across a range of diagnostic scenarios.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Skin deep unlearning: Artefact and instrument debiasing in the context of melanoma classification
  • P J Bevan
  • A Atapour-Abarghouei
Bevan, P.J., Atapour-Abarghouei, A.: Skin deep unlearning: Artefact and instrument debiasing in the context of melanoma classification. arXiv preprint arXiv:2109.09818 (2021)