PreprintPDF Available

Removing confounding information from fetal ultrasound images

March 2023

March 2023

License
CC BY 4.0

Authors:

Morten Bo Søndergaard Svendsen

Region Hovedstaden, København, Denmark

Show all 7 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Confounding information in the form of text or markings embedded in medical images can severely affect the training of diagnostic deep learning algorithms. However, data collected for clinical purposes often have such markings embedded in them. In dermatology, known examples include drawings or rulers that are overrepresented in images of malignant lesions. In this paper, we encounter text and calipers placed on the images found in national databases containing fetal screening ultrasound scans, which correlate with standard planes to be predicted. In order to utilize the vast amounts of data available in these databases, we develop and validate a series of methods for minimizing the confounding effects of embedded text and calipers on deep learning algorithms designed for ultrasound, using standard plane classification as a test case.

Illustration of inpainting methods replacing both text and calipers on two example image patches.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Removing confounding information from fetal

ultrasound images?

Kamil Mikolaj1,2, Manxi Lin1, Zahra Bashir2, Morten Bo Søndergaard

Svendsen2, Martin Tolsgaard2, Anders Nymark1, and Aasa Feragen1

1DTU Compute, Technical University of Denmark

2CAMES Rigshospitalet

{kmik, afhar}@dtu.dk

Abstract. Confounding information in the form of text or markings em-

bedded in medical images can severely aﬀect the training of diagnostic

deep learning algorithms. However, data collected for clinical purposes

often have such markings embedded in them. In dermatology, known ex-

amples include drawings or rulers that are overrepresented in images of

malignant lesions. In this paper, we encounter text and calipers placed

on the images found in national databases containing fetal screening ul-

trasound scans, which correlate with standard planes to be predicted. In

order to utilize the vast amounts of data available in these databases, we

develop and validate a series of methods for minimizing the confounding

eﬀects of embedded text and calipers on deep learning algorithms de-

signed for ultrasound, using standard plane classiﬁcation as a test case.

Keywords: Removing confounding information ·Model Bias ·Fetal Ul-

trasound ·Standard Plane Classiﬁcation.

1 Introduction

Clinical data is a great potential source of data for training medical imaging

models: It can come in vast amounts, and represents the nature of data quality

encountered in clinical practice. However, when data comes from the clinic, there

is little control of the data generating process, and the data may be aﬀected

in ways that are suboptimal for training deep learning models. In particular,

clinical images sometimes come with embedded text, markings, calipers or other

annotations made by the clinician, and these likely carry information correlating

with the predictive task at hand. It has recently been shown that markings,

stickers and rulers present in dermatological images can confound predictors

that aim to diagnose skin lesions [6,12,7]. In this paper, we consider confounding

information present in fetal ultrasound images from clinical screening. As shown

in Fig. 1, these images often have text and calipers embedded in them, which can

aﬀect predictors trained on the images. As a case study, we use standard plane

classiﬁcation, which aims to automatically recognize those ultrasound planes

required for particular types of screening tests during pregnancy.

?Supported by organization x.

arXiv:2303.13918v1 [cs.CV] 24 Mar 2023

2 F. Author et al.

Fig. 1. Examples of clinical ultrasound images with text and calipers, i.e. annotated

coordinates for measuring anatomical objects, embedded into the image. Note that

both the text labels and the caliper geometry carries information about the particular

standard plane that the image contains.

Our contribution First, we quantify the confounding eﬀect of text and

calipers embedded in fetal ultrasound images used for training neural networks.

Next, we quantitatively assess the success of diﬀerent methods that aim to re-

move these confounding eﬀects, ranging from naïve to state of the art text in-

painting methods developed for natural images. We show that even simple meth-

ods that mask out the confounding information ensure improved generalization

to images that do not contain confounding information.

1.1 Related work

Recent work in dermatological imaging pointed out that pen markings or rulers

that are often present in images of malignant skin lesions from clinical practice,

were actually confounding skin lesion diagnosis performed by a CNN approved

as a medical device [12,6]. More precisely, it was shown that the predictive mod-

els performed better on images with rulers and markings than on those without.

This discovery spurred research into methods for removing this confounding ef-

fect, such as segmenting the lesion as a preprocessing step [5] to avoid looking

at context; inpainting stickers present on benign images to remove them from

the training images [7]; inserting prior knowledge into the models [1,8]; or ad-

versarially training the neural network to be unable to recognize whether the

confounding information is present [2].

In our setting, the confounding information is typically embedded into the

clinically relevant part of the image. As a result, we cannot apply methods that

remove context by segmenting out the object of interest. Instead, we focus on

validating both simple and more complex models for removing and inpainting

confounding text and calipers.

Removing confounding information from fetal ultrasound images 3

2 Methodology

We assess the confounding eﬀects of embedded text and calipers using standard

plane classiﬁcation for 3rd trimester growth scans as a test case. In order to assess

the weight of the fetus, as is commonly done in the 3rd trimester, the clinician

needs to obtain standard planes for the head, abdomen and femur. As a typical

application is to recognize good standard planes from nonstandard planes that

cannot be used for standardized measurements, we also include images that are

not standard planes, which should ideally be classiﬁed as "Other".

We assess six diﬀerent methods for removing confounding information. Ex-

amples of images inpainted with the diﬀerent methods are found in Fig. 2.

2.1 Simple methods for removing confounding eﬀects

The initial four methods consist of ﬁrst detecting the text and calipers and then

replacing them in various ways.

The text and calipers embedded in the clinically relevant part of the image

is (in our training set) always yellow; these are detected via thresholding in hue,

saturation, value (HSV) space to segment yellow features. The resulting mask is

dilated with a 3x3 structuring element to enlarge the segmentation and connect

neighboring elements.

Additionally, as can be seen in Fig. 1, most images contain some gray and

blue text in the top right corner. To remove this, we ﬁrst remove everything

around the conical ultrasound ﬁeld of view. A mask is obtained by thresholding

the cone in HSV, ﬁnding the largest connected component and ﬁlling the holes,

after which everything else can be masked out. For cases where some blue and

grey text is on top of the ﬁeld of view, the blue text is detected, and everything

above is replaced by a black box.

We next consider various approaches to inpainting the yellow masks.

Black box. In this ﬁrst simple approach, the detected yellow mask is overlaid

by black boxes spanned by the minimum and maximum x- and y- coordinates

found within every connected component of the mask.

Replacing confounding information by noise. As the inpainted black boxes

leaves clearly visible information on relative position, caliper geometry etc, we

next replace the missing information with noise. For every connected compo-

nent of the mask, we ﬁnd its minimal bounding box as above, and expand the

bounding box given by the coordinates by 5 pixels in each direction. Next, the

contents of the box is replaced by noise as follows: The mean µand standard

deviation σare computed from the values of those bounding box pixels which

are not segmented as belonging to text/markings. Those pixels segmented as

text/markings are then replaced with noise sampled from a normal distribution

N(µ, σ/10). The scaling of the standard deviation was performed to reduce the

visual eﬀect of replacing image pixels with noise; the scaling factor selected from

{1,1

10 ,1

100 }by optimizing validation set accuracy.

4 F. Author et al.

Fig. 2. Illustration of inpainting methods replacing both text and calipers on two

example image patches.

Bilinear interpolation. Third, we apply bilinear interpolation to inpaint the

mask.

Fast marching inpainting. Next, we apply a fast marching inpainting method

available in OpenCV [11] to inpaint the mask.

2.2 Deep learning methods for removing confounding eﬀects

Next, we assess two diﬀerent deep learning approaches that aim to generate

confounder-free images. Due to the memory cost, the images are divided into

300 ×240 pixels patches for model training and inference.

U-Net. First, we train a U-net [9] to generate images without confounding

information. We regress the text-free image directly by designing the U-net so

that its input and output both have three channels. The pixel intensity of the

input images is rescaled to [−1,1], and the network output is activated by the

hyperbolic tangent function. The target labels are images inpainted by bilinear

interpolation. We use the SGD optimizer with a learning rate of 0.001 and mon-

itor the mean square root loss. The network is trained for 100 epochs with batch

size 8.

GAN for generating text-free images. As an alternative, we train a GAN [3],

which is a state-of-the-art model for scene text removal in natural images. We

follow the training settings from the original work. The training set is synthe-

sized by placing the markings detected by thresholding randomly on the images

inpainted by bilinear interpolation. This is a common way to construct datasets

in scene text removal.

Removing confounding information from fetal ultrasound images 5

2.3 Quantifying the eﬀect of confounders

In order to quantify confounding eﬀects, we train standard plane classiﬁcation

models to classify standard planes for the growth scans typically performed

during the 3rd trimester. For such scans, one typically collects the standard

planes for head, abdomen, and femur.

We train our models ﬁrst on raw images, and next on images where the con-

founding text and calipers are removed using the approaches listed above. Both

types of models are tested on an internal clinical dataset consisting of images

with confounders; an internal dataset consisting of images without confounders,

as well as an external dataset from a diﬀerent country, where the images do not

contain confounders.

3 Experiments and Results

3.1 Standard plane classiﬁcation models

We train the Eﬃcientnet B3 architecture [10] for standard plane classiﬁcation

using AdamW with a learning rate of 1e−4, with weighted cross entropy loss

to adjust the training to imbalanced data. No augmentation is applied. The

classiﬁer is trained for at most 50 epochs, using early stopping with a patience

of 5 epochs. We used PyTorch 1.10 for all deep learning based models.

3.2 Data

Internal database. We base our evaluation on 3rd trimester growth screening

images from the national fetal ultrasound screening programme from ANONY-

MOUS COUNTRY. The data was collected and used with permission from the

ANONYMISED. These images were annotated by an OB-Gyn resident as be-

ing either head, abdomen, femur or other, as these are the relevant standard

planes for growth estimation. Note, however, that as these images came from

the clinical screening database, from a trimester where screening scans are not

all made by expert sonographers, they were not all perfect standard planes. To

take this into account, the images were given a quality score from 0 (poor) to 10

(excellent), which is shown in Table 2.

Training data. We performed several experiments with diﬀerent conﬁgura-

tions of the training/validation/test split, all conﬁgured with no subject overlap

between splits.

We use two diﬀerent training set conﬁgurations for our experiments. All

training data is selected from the internal database. The class demographics for

both conﬁgurations are found in Table 1.

In the ﬁrst conﬁguration, the training set contains only images with text and

calipers. In the second conﬁguration, the training set contains both images with

(77%) and without (23%) text and calipers, sampled in such a way that we have

a similar numbers of images per class as in the ﬁrst conﬁguration.

6 F. Author et al.

Plane Train 1 Val 1 Train 2 Val 2 Local Test Local Test External Test

(with (no (no

confounders) confounders) confounders)

Head 669 147 670 134 127 121 3092

Abdomen 786 169 774 171 127 119 711

Femur 601 118 598 121 127 113 1040

Other 774 181 746 161 42 38 4213

N 2830 615 2788 587 423 391 9056

Table 1. Dataset demographics diﬀerent experiments, detailing training, validation

and test splits from the local and external databases.

Test data We test both on data from our own clinical screening database, and

on an external dataset from another country [4].

From the internal database we extracted two test sets: One with images

that contain text and calipers, and one with images that do not. The test sets

were designed to be identically distributed across the classes in order to get

comparable performance scores.

Images without text were automatically extracted from the database based

on HSV thresholding of the yellow color corresponding to the text. Since the

scanner’s model name is also yellow and is placed in the black area outside of the

ultrasound ﬁeld of view, yellow areas that are surrounded by black background

are excluded. This is accomplished by morphological dilation of the given area

to obtain its neighbouring pixels; if the mean of the neighbours is equal to the

background, then it is excluded.

Since the text and calipers are placed by the clinican on those images that

are in the end used for predicting clinical outcomes, there is an expected drop

in quality for those images that do not contain text and calipers (see Table 2).

The external database contains ultrasound scans from 2nd and 3rd trimester,

classiﬁed into a range of diﬀerent standard planes, whereas our internal data only

contains 3rd trimester. As we use the classes "head", "abdomen" and "femur",

the remaining images are given the class "other", which is likely diﬀerently dis-

tributed than the corresponding images from the internal test sets.

Plane Internal data Internal data

(with confounders) (no confounders)

Head 3.76 ±1.74 2.92 ±2.23

Abdomen 3.16 ±1.63 2.46 ±1.98

Femur 5.13 ±1.95 4.46 ±2.10

Table 2. The quality score of given standard planes used in internal data tests. Note

the lower quality of the ’no confounders’ data.

Removing confounding information from fetal ultrasound images 7

3.3 Assessing confounding eﬀects

For each experiment, the training and validation sets were resampled 10 times.

Performance was compared to the baseline using t-tests for equality of means of

the accuracies, reporting p-values computed over the 10 repeated runs.

3.4 Experimental results

Results for the diﬀerent training conﬁgurations for the standard plane classiﬁ-

cation are found in Tables 3 and 4.

Method Internal data pval Internal data pval External pval

(with confounders) (no confounders) (no confounders)

Baseline 97.0% ±1.1% - 85.6% ±4.1% - 80.5% ±4.0% -

BlackBox 96.3% ±1.1% 0.17 91.7% ±1.4% 2.8e-04 79.2% ±2.5% 0.38

Noise 95.8% ±1.6% 0.070 92.9% ±1.1% 2.9e-05 80.6% ±2.2% 0.97

FastMarching 96.7% ±1.6% 0.67 93.8% ±1.3% 1.0e-05 80.8% ±2.6% 0.82

Interpolation 96.1% ±1.5% 0.16 93.7% ±1.2% 1.0e-05 80.3% ±2.4% 0.91

GAN 95.9% ±1.7% 0.10 92.7% ±1.0% 4.1e-05 80.4% ±3.1% 0.96

U-net 96.5% ±1.3% 0.40 93.8% ±0.9% 7e-06 79.4% ±3.0% 0.49

Table 3. Classiﬁcation results training only on images with confounders.

Note that while there is no signiﬁcant diﬀerence between the methods on

the data with confounders, there is a signiﬁcant diﬀerence to the baseline for all

methods on the data with no confounders.

Method Internal data pval Internal data pval External pval

(with confounders) (no confounders) (no confounders)

Baseline 96.6% ±1.0% - 87.4% ±2.9% - 84.7% ±3.2% -

BlackBox 96.0% ±1.0% 0.21 92.5% ±1.9% 2.1e-04 80.9% ±3.1% 0.013

Noise 95.9% ±1.5% 0.23 93.5% ±2.3% 6.9e-05 82.9% ±3.1% 0.21

FastMarching 96.3% ±1.3% 0.59 94.5% ±1.8% 4.0e-06 83.1% ±2.6% 0.22

Interpolation 96.6% ±0.6% 1.0 94.6% ±1.0% 1.0e-06 82.8% ±1.8% 0.11

GAN 96.1% ±1.4% 0.31 93.6% ±1.5% 1.3e-05 83.0% ±1.5% 0.13

U-net 96.9% ±1.6% 0.70 94.7% ±2.1% 5e-06 81.3% ±2.1% 0.011

Table 4. Classiﬁcation results training both on images with and without confounders.

4 Discussion and conclusion

We have shown that deep learning algorithms can be confounded when trained

on clinical ultrasound images with embedded text or calipers. We have compared

8 F. Author et al.

several methods for removing text and calipers, ranging from simple detection,

removal and classical inpainting or interpolation, to state-of-the-art deep learn-

ing models developed to remove text from natural images. All methods have a

positive eﬀect by bringing classiﬁcation performance on clean test images closer

to the performance on test images with embedded confounders, even though

several of them leave visible artefacts that carry spatial information about the

removed confounders. Moreover, the simple methods are performing on par with

or even better than the deep learning models. One reason might be that while

the deep learning models re-predict the entire image, the simple methods only

replace those parts of the image corrupted by text and calipers. Another draw-

back is that the neural networks are learning texture features for the whole image

simultaneously, and inpainting text and calipers with such generic textures may

be less beneﬁcial than inpainting with locally inferred texture.

In terms of computational cost, at inference time the deep learning models

are about twice as fast as the classical methods. However, the deep learning

models also require training time. Moreover, as the classical methods run on

CPU, they could likely compete with the inference speed of the deep learning

models if they were also implemented on GPU.

We note that the performance on clean images from the internal dataset

is still slightly below the performance on images with embedded confounders.

This could be due to the generally lower quality of those images that do not have

embedded text and calipers. The images without text and calipers are those that

were not chosen as standard plane representatives of the clinician – the highest

quality images are the ones used for the clinical calculations and measurements.

Why is it so important to be able to train deep learning algorithms

on clinical quality images? Why don’t we, instead, perform our own data ac-

quisition obtaining images of the sought quality, but without embedded calipers

and text? For standard plane classiﬁcation, this might be feasible, but would still

leave us with far less training data than national screening databases can pro-

vide. Even more importantly, however, national screening databases also come

with potential for linking with registries cataloguing patient outcomes. Such reg-

istries would allow us to train models to recognize rare anomalies and diseases,

which we would have no guarantee of ﬁnding represented in a smaller dataset

acquired for the task. In order to train such models, we need to be able to train

our networks robustly, without being aﬀected by confounding information.

Why do we try to ﬁx the data, when instead we could try to ﬁx

the algorithm? Indeed, it is desirable to develop algorithms that are funda-

mentally robust to confounding information. Existing approaches to this problem

rely heavily on application speciﬁc prior knowledge, such as being able to seg-

ment the confounding information away from the image [1]. As, in our case, the

confounding information sits right on top of the most relevant part of the image,

these approaches do not carry over. By understanding how we might improve

training by debiasing our data, we believe we will be better equipped, down the

line, to develop algorithms that are inherently robust to confounders and bias.

Removing confounding information from fetal ultrasound images 9

Acknowledgements The work was partly funded in part by the Innovation

Fund Denmark for the project DIREC (9142-00001B); The Capital Region Re-

search Fund and The AI Signature Project, Danish Regions; and the Novo

Nordisk Foundation through the Center for Basic Machine Learning Research in

Life Science (NNF20OC0062606). The authors acknowledge the Pioneer Centre

for AI, DNRF grant P1.

References

1. Barnett, A.J., Schwartz, F.R., Tao, C., Chen, C., Ren, Y., Lo, J.Y., Rudin, C.: A

case-based interpretable deep learning model for classiﬁcation of mass lesions in

digital mammography. Nature Machine Intelligence 3(12), 1061–1070 (2021)

2. Bevan, P.J., Atapour-Abarghouei, A.: Skin deep unlearning: Artefact and in-

strument debiasing in the context of melanoma classiﬁcation. arXiv preprint

arXiv:2109.09818 (2021)

3. Bian, X., Wang, C., Quan, W., Ye, J., Zhang, X., Yan, D.M.: Scene text removal

via cascaded text stroke detection and erasing. Computational Visual Media 8(2),

273–287 (2022)

4. Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-

Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional

neural networks for automatic classiﬁcation of common maternal fetal ultrasound

planes. Scientiﬁc Reports 10(1), 1–12 (2020)

5. Maron, R.C., Hekler, A., Krieghoﬀ-Henning, E., Schmitt, M., Schlager, J.G.,

Utikal, J.S., Brinker, T.J.: Reducing the impact of confounding factors on skin

cancer classiﬁcation via image segmentation: technical model study. Journal of

Medical Internet Research 23(3), e21695 (2021)

6. Narla, A., Kuprel, B., Sarin, K., Novoa, R., Ko, J.: Automated classiﬁcation of

skin lesions: from pixels to practice. Journal of Investigative Dermatology 138(10),

2108–2110 (2018)

7. Nauta, M., Walsh, R., Dubowski, A., Seifert, C.: Uncovering and correcting short-

cut learning in machine learning models for skin cancer diagnosis. Diagnostics

12(1), 40 (2022)

8. Rieger, L., Singh, C., Murdoch, W., Yu, B.: Interpretations are useful: penaliz-

ing explanations to align neural networks with prior knowledge. In: International

conference on machine learning. pp. 8116–8126. PMLR (2020)

9. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-

cal image segmentation. In: International Conference on Medical image computing

and computer-assisted intervention. pp. 234–241. Springer (2015)

10. Tan, M., Le, Q.: Eﬃcientnet: Rethinking model scaling for convolutional neural

networks. In: International conference on machine learning. pp. 6105–6114. PMLR

(2019)

11. Telea, A.: An image inpainting technique based on the fast marching method. Jour-

nal of Graphics Tools 9(01 2004). https://doi.org/10.1080/10867651.2004.

10487596

12. Winkler, J.K., Fink, C., Toberer, F., Enk, A., Deinlein, T., Hofmann-Wellenhof,

R., Thomas, L., Lallas, A., Blum, A., Stolz, W., et al.: Association between surgical

skin markings in dermoscopic images and diagnostic performance of a deep learn-

ing convolutional neural network for melanoma recognition. JAMA dermatology

155(10), 1135–1141 (2019)

ResearchGate has not been able to resolve any citations for this publication.

Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis

Article

Full-text available

Dec 2021

Machine learning models have been successfully applied for analysis of skin images. However, due to the black box nature of such deep learning models, it is difficult to understand their underlying reasoning. This prevents a human from validating whether the model is right for the right reasons. Spurious correlations and other biases in data can cause a model to base its predictions on such artefacts rather than on the true relevant information. These learned shortcuts can in turn cause incorrect performance estimates and can result in unexpected outcomes when the model is applied in clinical practice. This study presents a method to detect and quantify this shortcut learning in trained classifiers for skin cancer diagnosis, since it is known that dermoscopy images can contain artefacts. Specifically, we train a standard VGG16-based skin cancer classifier on the public ISIC dataset, for which colour calibration charts (elliptical, coloured patches) occur only in benign images and not in malignant ones. Our methodology artificially inserts those patches and uses inpainting to automatically remove patches from images to assess the changes in predictions. We find that our standard classifier partly bases its predictions of benign images on the presence of such a coloured patch. More importantly, by artificially inserting coloured patches into malignant images, we show that shortcut learning results in a significant increase in misdiagnoses, making the classifier unreliable when used in clinical practice. With our results, we, therefore, want to increase awareness of the risks of using black box machine learning models trained on potentially biased datasets. Finally, we present a model-agnostic method to neutralise shortcut learning by removing the bias in the training dataset by exchanging coloured patches with benign skin tissue using image inpainting and re-training the classifier on this de-biased dataset.

A case-based interpretable deep learning model for classification of mass lesions in digital mammography

Article

Full-text available

Dec 2021

Interpretability in machine learning models is important in high-stakes decisions such as whether to order a biopsy based on a mammographic exam. Mammography poses important challenges that are not present in other computer vision tasks: datasets are small, confounding information is present and it can be difficult even for a radiologist to decide between watchful waiting and biopsy based on a mammogram alone. In this work we present a framework for interpretable machine learning-based mammography. In addition to predicting whether a lesion is malignant or benign, our work aims to follow the reasoning processes of radiologists in detecting clinically relevant semantic features of each image, such as the characteristics of the mass margins. The framework includes a novel interpretable neural network algorithm that uses case-based reasoning for mammography. Our algorithm can incorporate a combination of data with whole image labelling and data with pixel-wise annotations, leading to better accuracy and interpretability even with a small number of images. Our interpretable models are able to highlight the classification-relevant parts of the image, whereas other methods highlight healthy tissue and confounding information. Our models are decision aids—rather than decision makers—and aim for better overall human–machine collaboration. We do not observe a loss in mass margin classification accuracy over a black box neural network trained on the same data.

Scene text removal via cascaded text stroke detection and erasing

Article

Full-text available

Jun 2022

Recent learning-based approaches show promising performance improvement for the scene text removal task but usually leave several remnants of text and provide visually unpleasant results. In this work, a novel end-to-end framework is proposed based on accurate text stroke detection. Specifically, the text removal problem is decoupled into text stroke detection and stroke removal; we design separate networks to solve these two subproblems, the latter being a generative network. These two networks are combined as a processing unit, which is cascaded to obtain our final model for text removal. Experimental results demonstrate that the proposed method substantially outperforms the state-of-the-art for locating and erasing scene text. A new large-scale real-world dataset with 12,120 images has been constructed and is being made available to facilitate research, as current publicly available datasets are mainly synthetic so cannot properly measure the performance of different methods.

Reducing the Impact of Confounding Factors on Skin Cancer Classification via Image Segmentation: Technical Model Study

Article

Full-text available

Mar 2021
J MED INTERNET RES

Background Studies have shown that artificial intelligence achieves similar or better performance than dermatologists in specific dermoscopic image classification tasks. However, artificial intelligence is susceptible to the influence of confounding factors within images (eg, skin markings), which can lead to false diagnoses of cancerous skin lesions. Image segmentation can remove lesion-adjacent confounding factors but greatly change the image representation. Objective The aim of this study was to compare the performance of 2 image classification workflows where images were either segmented or left unprocessed before the subsequent training and evaluation of a binary skin lesion classifier. Methods Separate binary skin lesion classifiers (nevus vs melanoma) were trained and evaluated on segmented and unsegmented dermoscopic images. For a more informative result, separate classifiers were trained on 2 distinct training data sets (human against machine [HAM] and International Skin Imaging Collaboration [ISIC]). Each training run was repeated 5 times. The mean performance of the 5 runs was evaluated on a multi-source test set (n=688) consisting of a holdout and an external component. ResultsOur findings showed that when trained on HAM, the segmented classifiers showed a higher overall balanced accuracy (75.6% [SD 1.1%]) than the unsegmented classifiers (66.7% [SD 3.2%]), which was significant in 4 out of 5 runs (P

Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes

Article

Full-text available

Jun 2020

The goal of this study was to evaluate the maturity of current Deep Learning classification techniques for their application in a real maternal-fetal clinical environment. A large dataset of routinely acquired maternal-fetal screening ultrasound images (which will be made publicly available) was collected from two different hospitals by several operators and ultrasound machines. All images were manually labeled by an expert maternal fetal clinician. Images were divided into 6 classes: four of the most widely used fetal anatomical planes (Abdomen, Brain, Femur and Thorax), the mother’s cervix (widely used for prematurity screening) and a general category to include any other less common image plane. Fetal brain images were further categorized into the 3 most common fetal brain planes (Trans-thalamic, Trans-cerebellum, Trans-ventricular) to judge fine grain categorization performance. The final dataset is comprised of over 12,400 images from 1,792 patients, making it the largest ultrasound dataset to date. We then evaluated a wide variety of state-of-the-art deep Convolutional Neural Networks on this dataset and analyzed results in depth, comparing the computational models to research technicians, which are the ones currently performing the task daily. Results indicate for the first time that computational models have similar performance compared to humans when classifying common planes in human fetal examination. However, the dataset leaves the door open on future research to further improve results, especially on fine-grained plane categorization.

Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition

Article

Full-text available

Aug 2019

Importance Deep learning convolutional neural networks (CNNs) have shown a performance at the level of dermatologists in the diagnosis of melanoma. Accordingly, further exploring the potential limitations of CNN technology before broadly applying it is of special interest. Objective To investigate the association between gentian violet surgical skin markings in dermoscopic images and the diagnostic performance of a CNN approved for use as a medical device in the European market. Design and Setting A cross-sectional analysis was conducted from August 1, 2018, to November 30, 2018, using a CNN architecture trained with more than 120 000 dermoscopic images of skin neoplasms and corresponding diagnoses. The association of gentian violet skin markings in dermoscopic images with the performance of the CNN was investigated in 3 image sets of 130 melanocytic lesions each (107 benign nevi, 23 melanomas). Exposures The same lesions were sequentially imaged with and without the application of a gentian violet surgical skin marker and then evaluated by the CNN for their probability of being a melanoma. In addition, the markings were removed by manually cropping the dermoscopic images to focus on the melanocytic lesion. Main Outcomes and Measures Sensitivity, specificity, and area under the curve (AUC) of the receiver operating characteristic (ROC) curve for the CNN’s diagnostic classification in unmarked, marked, and cropped images. Results In all, 130 melanocytic lesions (107 benign nevi and 23 melanomas) were imaged. In unmarked lesions, the CNN achieved a sensitivity of 95.7% (95% CI, 79%-99.2%) and a specificity of 84.1% (95% CI, 76.0%-89.8%). The ROC AUC was 0.969. In marked lesions, an increase in melanoma probability scores was observed that resulted in a sensitivity of 100% (95% CI, 85.7%-100%) and a significantly reduced specificity of 45.8% (95% CI, 36.7%-55.2%, P < .001). The ROC AUC was 0.922. Cropping images led to the highest sensitivity of 100% (95% CI, 85.7%-100%), specificity of 97.2% (95% CI, 92.1%-99.0%), and ROC AUC of 0.993. Heat maps created by vanilla gradient descent backpropagation indicated that the blue markings were associated with the increased false-positive rate. Conclusions and Relevance This study’s findings suggest that skin markings significantly interfered with the CNN’s correct diagnosis of nevi by increasing the melanoma probability scores and consequently the false-positive rate. A predominance of skin markings in melanoma training images may have induced the CNN’s association of markings with a melanoma diagnosis. Accordingly, these findings suggest that skin markings should be avoided in dermoscopic images intended for analysis by a CNN. Trial Registration German Clinical Trial Register (DRKS) Identifier: DRKS00013570

An Image Inpainting Technique Based on the Fast Marching Method

Article

Full-text available

Jan 2004

Alexandru Telea

Digital inpainting provides a means for reconstruction of small damaged portions of an image. Although the inpainting basics are straightforward, most inpainting techniques published in the literature are complex to understand and implement. We present here a new algorithm for digital inpainting based on the fast marching method for level set applications. Our algorithm is very simple to implement, fast, and produces nearly identical results to more complex, and usually slower, known methods. Source code is available online.

Automated Classification of Skin Lesions: From Pixels to Practice

Article

Oct 2018
J INVEST DERMATOL

The letters "Interpretation of the Outputs of Deep Learning Model trained with Skin Cancer Dataset" and "Automated Dermatological Diagnosis: Hype or Reality?" highlight the opportunities, hurdles, and possible pitfalls with the development of tools that allow for automated skin lesion classification. The potential clinical impact of these advances relies on their scalability, accuracy, and generalizability across a range of diagnostic scenarios.

U-Net: Convolutional Networks for Biomedical Image Segmentation

Conference Paper

Oct 2015

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

Skin deep unlearning: Artefact and instrument debiasing in the context of melanoma classification

Jan 2021

P J Bevan
A Atapour-Abarghouei

Bevan, P.J., Atapour-Abarghouei, A.: Skin deep unlearning: Artefact and instrument debiasing in the context of melanoma classification. arXiv preprint arXiv:2109.09818 (2021)

Removing confounding information from fetal ultrasound images

Abstract and Figures

Recommended publications

DTU-Net: Learning Topological Similarity for Curvilinear Structure Segmentation

DTU-Net: Learning Topological Similarity for Curvilinear Structure Segmentation

I saw, I conceived, I concluded: Progressive Concepts as Bottlenecks

An Automatic Guidance and Quality Assessment System for Doppler Imaging of Umbilical Artery