Available via license: CC BY 4.0
Content may be subject to copyright.
Removing confounding information from fetal
ultrasound images?
Kamil Mikolaj1,2, Manxi Lin1, Zahra Bashir2, Morten Bo Søndergaard
Svendsen2, Martin Tolsgaard2, Anders Nymark1, and Aasa Feragen1
1DTU Compute, Technical University of Denmark
2CAMES Rigshospitalet
{kmik, afhar}@dtu.dk
Abstract. Confounding information in the form of text or markings em-
bedded in medical images can severely affect the training of diagnostic
deep learning algorithms. However, data collected for clinical purposes
often have such markings embedded in them. In dermatology, known ex-
amples include drawings or rulers that are overrepresented in images of
malignant lesions. In this paper, we encounter text and calipers placed
on the images found in national databases containing fetal screening ul-
trasound scans, which correlate with standard planes to be predicted. In
order to utilize the vast amounts of data available in these databases, we
develop and validate a series of methods for minimizing the confounding
effects of embedded text and calipers on deep learning algorithms de-
signed for ultrasound, using standard plane classification as a test case.
Keywords: Removing confounding information ·Model Bias ·Fetal Ul-
trasound ·Standard Plane Classification.
1 Introduction
Clinical data is a great potential source of data for training medical imaging
models: It can come in vast amounts, and represents the nature of data quality
encountered in clinical practice. However, when data comes from the clinic, there
is little control of the data generating process, and the data may be affected
in ways that are suboptimal for training deep learning models. In particular,
clinical images sometimes come with embedded text, markings, calipers or other
annotations made by the clinician, and these likely carry information correlating
with the predictive task at hand. It has recently been shown that markings,
stickers and rulers present in dermatological images can confound predictors
that aim to diagnose skin lesions [6,12,7]. In this paper, we consider confounding
information present in fetal ultrasound images from clinical screening. As shown
in Fig. 1, these images often have text and calipers embedded in them, which can
affect predictors trained on the images. As a case study, we use standard plane
classification, which aims to automatically recognize those ultrasound planes
required for particular types of screening tests during pregnancy.
?Supported by organization x.
arXiv:2303.13918v1 [cs.CV] 24 Mar 2023
2 F. Author et al.
Fig. 1. Examples of clinical ultrasound images with text and calipers, i.e. annotated
coordinates for measuring anatomical objects, embedded into the image. Note that
both the text labels and the caliper geometry carries information about the particular
standard plane that the image contains.
Our contribution First, we quantify the confounding effect of text and
calipers embedded in fetal ultrasound images used for training neural networks.
Next, we quantitatively assess the success of different methods that aim to re-
move these confounding effects, ranging from naïve to state of the art text in-
painting methods developed for natural images. We show that even simple meth-
ods that mask out the confounding information ensure improved generalization
to images that do not contain confounding information.
1.1 Related work
Recent work in dermatological imaging pointed out that pen markings or rulers
that are often present in images of malignant skin lesions from clinical practice,
were actually confounding skin lesion diagnosis performed by a CNN approved
as a medical device [12,6]. More precisely, it was shown that the predictive mod-
els performed better on images with rulers and markings than on those without.
This discovery spurred research into methods for removing this confounding ef-
fect, such as segmenting the lesion as a preprocessing step [5] to avoid looking
at context; inpainting stickers present on benign images to remove them from
the training images [7]; inserting prior knowledge into the models [1,8]; or ad-
versarially training the neural network to be unable to recognize whether the
confounding information is present [2].
In our setting, the confounding information is typically embedded into the
clinically relevant part of the image. As a result, we cannot apply methods that
remove context by segmenting out the object of interest. Instead, we focus on
validating both simple and more complex models for removing and inpainting
confounding text and calipers.
Removing confounding information from fetal ultrasound images 3
2 Methodology
We assess the confounding effects of embedded text and calipers using standard
plane classification for 3rd trimester growth scans as a test case. In order to assess
the weight of the fetus, as is commonly done in the 3rd trimester, the clinician
needs to obtain standard planes for the head, abdomen and femur. As a typical
application is to recognize good standard planes from nonstandard planes that
cannot be used for standardized measurements, we also include images that are
not standard planes, which should ideally be classified as "Other".
We assess six different methods for removing confounding information. Ex-
amples of images inpainted with the different methods are found in Fig. 2.
2.1 Simple methods for removing confounding effects
The initial four methods consist of first detecting the text and calipers and then
replacing them in various ways.
The text and calipers embedded in the clinically relevant part of the image
is (in our training set) always yellow; these are detected via thresholding in hue,
saturation, value (HSV) space to segment yellow features. The resulting mask is
dilated with a 3x3 structuring element to enlarge the segmentation and connect
neighboring elements.
Additionally, as can be seen in Fig. 1, most images contain some gray and
blue text in the top right corner. To remove this, we first remove everything
around the conical ultrasound field of view. A mask is obtained by thresholding
the cone in HSV, finding the largest connected component and filling the holes,
after which everything else can be masked out. For cases where some blue and
grey text is on top of the field of view, the blue text is detected, and everything
above is replaced by a black box.
We next consider various approaches to inpainting the yellow masks.
Black box. In this first simple approach, the detected yellow mask is overlaid
by black boxes spanned by the minimum and maximum x- and y- coordinates
found within every connected component of the mask.
Replacing confounding information by noise. As the inpainted black boxes
leaves clearly visible information on relative position, caliper geometry etc, we
next replace the missing information with noise. For every connected compo-
nent of the mask, we find its minimal bounding box as above, and expand the
bounding box given by the coordinates by 5 pixels in each direction. Next, the
contents of the box is replaced by noise as follows: The mean µand standard
deviation σare computed from the values of those bounding box pixels which
are not segmented as belonging to text/markings. Those pixels segmented as
text/markings are then replaced with noise sampled from a normal distribution
N(µ, σ/10). The scaling of the standard deviation was performed to reduce the
visual effect of replacing image pixels with noise; the scaling factor selected from
{1,1
10 ,1
100 }by optimizing validation set accuracy.
4 F. Author et al.
Fig. 2. Illustration of inpainting methods replacing both text and calipers on two
example image patches.
Bilinear interpolation. Third, we apply bilinear interpolation to inpaint the
mask.
Fast marching inpainting. Next, we apply a fast marching inpainting method
available in OpenCV [11] to inpaint the mask.
2.2 Deep learning methods for removing confounding effects
Next, we assess two different deep learning approaches that aim to generate
confounder-free images. Due to the memory cost, the images are divided into
300 ×240 pixels patches for model training and inference.
U-Net. First, we train a U-net [9] to generate images without confounding
information. We regress the text-free image directly by designing the U-net so
that its input and output both have three channels. The pixel intensity of the
input images is rescaled to [−1,1], and the network output is activated by the
hyperbolic tangent function. The target labels are images inpainted by bilinear
interpolation. We use the SGD optimizer with a learning rate of 0.001 and mon-
itor the mean square root loss. The network is trained for 100 epochs with batch
size 8.
GAN for generating text-free images. As an alternative, we train a GAN [3],
which is a state-of-the-art model for scene text removal in natural images. We
follow the training settings from the original work. The training set is synthe-
sized by placing the markings detected by thresholding randomly on the images
inpainted by bilinear interpolation. This is a common way to construct datasets
in scene text removal.
Removing confounding information from fetal ultrasound images 5
2.3 Quantifying the effect of confounders
In order to quantify confounding effects, we train standard plane classification
models to classify standard planes for the growth scans typically performed
during the 3rd trimester. For such scans, one typically collects the standard
planes for head, abdomen, and femur.
We train our models first on raw images, and next on images where the con-
founding text and calipers are removed using the approaches listed above. Both
types of models are tested on an internal clinical dataset consisting of images
with confounders; an internal dataset consisting of images without confounders,
as well as an external dataset from a different country, where the images do not
contain confounders.
3 Experiments and Results
3.1 Standard plane classification models
We train the Efficientnet B3 architecture [10] for standard plane classification
using AdamW with a learning rate of 1e−4, with weighted cross entropy loss
to adjust the training to imbalanced data. No augmentation is applied. The
classifier is trained for at most 50 epochs, using early stopping with a patience
of 5 epochs. We used PyTorch 1.10 for all deep learning based models.
3.2 Data
Internal database. We base our evaluation on 3rd trimester growth screening
images from the national fetal ultrasound screening programme from ANONY-
MOUS COUNTRY. The data was collected and used with permission from the
ANONYMISED. These images were annotated by an OB-Gyn resident as be-
ing either head, abdomen, femur or other, as these are the relevant standard
planes for growth estimation. Note, however, that as these images came from
the clinical screening database, from a trimester where screening scans are not
all made by expert sonographers, they were not all perfect standard planes. To
take this into account, the images were given a quality score from 0 (poor) to 10
(excellent), which is shown in Table 2.
Training data. We performed several experiments with different configura-
tions of the training/validation/test split, all configured with no subject overlap
between splits.
We use two different training set configurations for our experiments. All
training data is selected from the internal database. The class demographics for
both configurations are found in Table 1.
In the first configuration, the training set contains only images with text and
calipers. In the second configuration, the training set contains both images with
(77%) and without (23%) text and calipers, sampled in such a way that we have
a similar numbers of images per class as in the first configuration.
6 F. Author et al.
Plane Train 1 Val 1 Train 2 Val 2 Local Test Local Test External Test
(with (no (no
confounders) confounders) confounders)
Head 669 147 670 134 127 121 3092
Abdomen 786 169 774 171 127 119 711
Femur 601 118 598 121 127 113 1040
Other 774 181 746 161 42 38 4213
N 2830 615 2788 587 423 391 9056
Table 1. Dataset demographics different experiments, detailing training, validation
and test splits from the local and external databases.
Test data We test both on data from our own clinical screening database, and
on an external dataset from another country [4].
From the internal database we extracted two test sets: One with images
that contain text and calipers, and one with images that do not. The test sets
were designed to be identically distributed across the classes in order to get
comparable performance scores.
Images without text were automatically extracted from the database based
on HSV thresholding of the yellow color corresponding to the text. Since the
scanner’s model name is also yellow and is placed in the black area outside of the
ultrasound field of view, yellow areas that are surrounded by black background
are excluded. This is accomplished by morphological dilation of the given area
to obtain its neighbouring pixels; if the mean of the neighbours is equal to the
background, then it is excluded.
Since the text and calipers are placed by the clinican on those images that
are in the end used for predicting clinical outcomes, there is an expected drop
in quality for those images that do not contain text and calipers (see Table 2).
The external database contains ultrasound scans from 2nd and 3rd trimester,
classified into a range of different standard planes, whereas our internal data only
contains 3rd trimester. As we use the classes "head", "abdomen" and "femur",
the remaining images are given the class "other", which is likely differently dis-
tributed than the corresponding images from the internal test sets.
Plane Internal data Internal data
(with confounders) (no confounders)
Head 3.76 ±1.74 2.92 ±2.23
Abdomen 3.16 ±1.63 2.46 ±1.98
Femur 5.13 ±1.95 4.46 ±2.10
Table 2. The quality score of given standard planes used in internal data tests. Note
the lower quality of the ’no confounders’ data.
Removing confounding information from fetal ultrasound images 7
3.3 Assessing confounding effects
For each experiment, the training and validation sets were resampled 10 times.
Performance was compared to the baseline using t-tests for equality of means of
the accuracies, reporting p-values computed over the 10 repeated runs.
3.4 Experimental results
Results for the different training configurations for the standard plane classifi-
cation are found in Tables 3 and 4.
Method Internal data pval Internal data pval External pval
(with confounders) (no confounders) (no confounders)
Baseline 97.0% ±1.1% - 85.6% ±4.1% - 80.5% ±4.0% -
BlackBox 96.3% ±1.1% 0.17 91.7% ±1.4% 2.8e-04 79.2% ±2.5% 0.38
Noise 95.8% ±1.6% 0.070 92.9% ±1.1% 2.9e-05 80.6% ±2.2% 0.97
FastMarching 96.7% ±1.6% 0.67 93.8% ±1.3% 1.0e-05 80.8% ±2.6% 0.82
Interpolation 96.1% ±1.5% 0.16 93.7% ±1.2% 1.0e-05 80.3% ±2.4% 0.91
GAN 95.9% ±1.7% 0.10 92.7% ±1.0% 4.1e-05 80.4% ±3.1% 0.96
U-net 96.5% ±1.3% 0.40 93.8% ±0.9% 7e-06 79.4% ±3.0% 0.49
Table 3. Classification results training only on images with confounders.
Note that while there is no significant difference between the methods on
the data with confounders, there is a significant difference to the baseline for all
methods on the data with no confounders.
Method Internal data pval Internal data pval External pval
(with confounders) (no confounders) (no confounders)
Baseline 96.6% ±1.0% - 87.4% ±2.9% - 84.7% ±3.2% -
BlackBox 96.0% ±1.0% 0.21 92.5% ±1.9% 2.1e-04 80.9% ±3.1% 0.013
Noise 95.9% ±1.5% 0.23 93.5% ±2.3% 6.9e-05 82.9% ±3.1% 0.21
FastMarching 96.3% ±1.3% 0.59 94.5% ±1.8% 4.0e-06 83.1% ±2.6% 0.22
Interpolation 96.6% ±0.6% 1.0 94.6% ±1.0% 1.0e-06 82.8% ±1.8% 0.11
GAN 96.1% ±1.4% 0.31 93.6% ±1.5% 1.3e-05 83.0% ±1.5% 0.13
U-net 96.9% ±1.6% 0.70 94.7% ±2.1% 5e-06 81.3% ±2.1% 0.011
Table 4. Classification results training both on images with and without confounders.
4 Discussion and conclusion
We have shown that deep learning algorithms can be confounded when trained
on clinical ultrasound images with embedded text or calipers. We have compared
8 F. Author et al.
several methods for removing text and calipers, ranging from simple detection,
removal and classical inpainting or interpolation, to state-of-the-art deep learn-
ing models developed to remove text from natural images. All methods have a
positive effect by bringing classification performance on clean test images closer
to the performance on test images with embedded confounders, even though
several of them leave visible artefacts that carry spatial information about the
removed confounders. Moreover, the simple methods are performing on par with
or even better than the deep learning models. One reason might be that while
the deep learning models re-predict the entire image, the simple methods only
replace those parts of the image corrupted by text and calipers. Another draw-
back is that the neural networks are learning texture features for the whole image
simultaneously, and inpainting text and calipers with such generic textures may
be less beneficial than inpainting with locally inferred texture.
In terms of computational cost, at inference time the deep learning models
are about twice as fast as the classical methods. However, the deep learning
models also require training time. Moreover, as the classical methods run on
CPU, they could likely compete with the inference speed of the deep learning
models if they were also implemented on GPU.
We note that the performance on clean images from the internal dataset
is still slightly below the performance on images with embedded confounders.
This could be due to the generally lower quality of those images that do not have
embedded text and calipers. The images without text and calipers are those that
were not chosen as standard plane representatives of the clinician – the highest
quality images are the ones used for the clinical calculations and measurements.
Why is it so important to be able to train deep learning algorithms
on clinical quality images? Why don’t we, instead, perform our own data ac-
quisition obtaining images of the sought quality, but without embedded calipers
and text? For standard plane classification, this might be feasible, but would still
leave us with far less training data than national screening databases can pro-
vide. Even more importantly, however, national screening databases also come
with potential for linking with registries cataloguing patient outcomes. Such reg-
istries would allow us to train models to recognize rare anomalies and diseases,
which we would have no guarantee of finding represented in a smaller dataset
acquired for the task. In order to train such models, we need to be able to train
our networks robustly, without being affected by confounding information.
Why do we try to fix the data, when instead we could try to fix
the algorithm? Indeed, it is desirable to develop algorithms that are funda-
mentally robust to confounding information. Existing approaches to this problem
rely heavily on application specific prior knowledge, such as being able to seg-
ment the confounding information away from the image [1]. As, in our case, the
confounding information sits right on top of the most relevant part of the image,
these approaches do not carry over. By understanding how we might improve
training by debiasing our data, we believe we will be better equipped, down the
line, to develop algorithms that are inherently robust to confounders and bias.
Removing confounding information from fetal ultrasound images 9
Acknowledgements The work was partly funded in part by the Innovation
Fund Denmark for the project DIREC (9142-00001B); The Capital Region Re-
search Fund and The AI Signature Project, Danish Regions; and the Novo
Nordisk Foundation through the Center for Basic Machine Learning Research in
Life Science (NNF20OC0062606). The authors acknowledge the Pioneer Centre
for AI, DNRF grant P1.
References
1. Barnett, A.J., Schwartz, F.R., Tao, C., Chen, C., Ren, Y., Lo, J.Y., Rudin, C.: A
case-based interpretable deep learning model for classification of mass lesions in
digital mammography. Nature Machine Intelligence 3(12), 1061–1070 (2021)
2. Bevan, P.J., Atapour-Abarghouei, A.: Skin deep unlearning: Artefact and in-
strument debiasing in the context of melanoma classification. arXiv preprint
arXiv:2109.09818 (2021)
3. Bian, X., Wang, C., Quan, W., Ye, J., Zhang, X., Yan, D.M.: Scene text removal
via cascaded text stroke detection and erasing. Computational Visual Media 8(2),
273–287 (2022)
4. Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-
Carne, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional
neural networks for automatic classification of common maternal fetal ultrasound
planes. Scientific Reports 10(1), 1–12 (2020)
5. Maron, R.C., Hekler, A., Krieghoff-Henning, E., Schmitt, M., Schlager, J.G.,
Utikal, J.S., Brinker, T.J.: Reducing the impact of confounding factors on skin
cancer classification via image segmentation: technical model study. Journal of
Medical Internet Research 23(3), e21695 (2021)
6. Narla, A., Kuprel, B., Sarin, K., Novoa, R., Ko, J.: Automated classification of
skin lesions: from pixels to practice. Journal of Investigative Dermatology 138(10),
2108–2110 (2018)
7. Nauta, M., Walsh, R., Dubowski, A., Seifert, C.: Uncovering and correcting short-
cut learning in machine learning models for skin cancer diagnosis. Diagnostics
12(1), 40 (2022)
8. Rieger, L., Singh, C., Murdoch, W., Yu, B.: Interpretations are useful: penaliz-
ing explanations to align neural networks with prior knowledge. In: International
conference on machine learning. pp. 8116–8126. PMLR (2020)
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234–241. Springer (2015)
10. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural
networks. In: International conference on machine learning. pp. 6105–6114. PMLR
(2019)
11. Telea, A.: An image inpainting technique based on the fast marching method. Jour-
nal of Graphics Tools 9(01 2004). https://doi.org/10.1080/10867651.2004.
10487596
12. Winkler, J.K., Fink, C., Toberer, F., Enk, A., Deinlein, T., Hofmann-Wellenhof,
R., Thomas, L., Lallas, A., Blum, A., Stolz, W., et al.: Association between surgical
skin markings in dermoscopic images and diagnostic performance of a deep learn-
ing convolutional neural network for melanoma recognition. JAMA dermatology
155(10), 1135–1141 (2019)