PreprintPDF Available

Robustness of Probabilistic U-Net for Automated Segmentation of White Matter Hyperintensity in Different Datasets of Brain MRI

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
ROBUSTNESS OF PROBABILISTIC U-NET
FOR AUTOMATED SE G M E N TATION OF WHITE MATTER
HYPERINTENSITIES IN DIFFERENT DATASETS OF BRAIN MRI
Rizal Maulana
Faculty of Computer Science
Universitas Indonesia
Depok, Indonesia
rizal.maulana01@ui.ac.id
Muhammad Febrian Rachmadi
Brain Image Analysis Unit
RIKEN Center for Brain Science
Wako, Japan
Laksmita Rahadianti
Faculty of Computer Science
Universitas Indonesia
Depok, Indonesia
November 18, 2021
ABS TRAC T
White Matter Hyperintensities (WMHs) are neuroradiological features often seen in T2-FLAIR
brain MRI as white regions (i.e., hyperintensities) and characteristic of small vessel disease (SVD).
Detailed measurements of WMHs (e.g., their volumes, locations, distributions) are vital for clinical
research, but segmenting WMHs is challenging due to WMHs’ ill-posed boundaries. In this study, we
investigate the robustness of Probabilistic U-Net and other deterministic deep learning models (i.e.,
U-Net and its variations) for automatic segmentation of WMHs. In particular, we are interested in the
robustness of U-Net based deep learning models, especially the Probabilistic U-Net, for segmenting
WMHs in brain MRI from different datasets. Thus, we performed two different experiments, which
are
k
-fold cross validation experiment (i.e., training and testing using the same dataset) and cross
dataset experiment (i.e., testing in different dataset). Based on our experiments, Probabilistic U-Net
outperformed other tested models in
k
-fold cross validation experiment. On the other hand, we found
that Probabilistic U-Net captured different types of uncertainty when tested in different dataset.
Keywords White Matter Hyperintensities (WMHs)
·
segmentation of WMHs
·
probabilistic model
·
U-Net
·
Probabilistic U-Net ·uncertainty ·robustness.
1 Introduction
White Matter Hyperintensities (WMHs) are neuroradiological features often seen in T2-FLAIR brain MRI and
characteristic of small vessel disease (SVD) [
1
]. In T2-FLAIR brain MRI, WMHs appear as white regions (i.e.,
hyperintensities) which makes it easier to discern and differentiate WMHs from normal tissues of the brain which
usually appear in darker (i.e., grey) colours [
2
,
3
]. Clinically, WMHs have been associated with neurodegenerative
diseases such as Alzheimer’s disease, stroke, dementia, and mood disorder [4].
Detailed measurements of WMHs (e.g., their volumes, locations, distributions) are required for finding best treatment in
clinical research [
4
], but manually segmenting and assessing WMHs for each patient are very expensive. Furthermore,
segmentation of WMHs is challenging due to WMHs’ ill-posed boundaries. Instead of clear boundary between WMHs
and non-WMHs regions, WMHs have gradual changes of intensity along their borders commonly referred to as the
“penumbra” of WMHs [
5
]. The penumbra of WMHs has been a subject of many studies which debate criteria to
correctly identify the WMHs borders [
6
,
7
]. In some cases, the penumbra of WMHs might appear very similar to MRI’s
artefacts and non-WMHs regions [
1
]. Thus, manual assessment of WMHs is known to have lower rate of inter-rater
reliability among raters.
There have been many studies that propose automatic image segmentation models for biomedical image using deep
learning [
8
,
9
,
10
,
11
,
12
]. However, these studies mostly evaluated their models using test set that comes from the
same dataset for training set [
4
]. This treatment does not answer the question of how effective their proposed models
APREPRINT - NOV EM BER 18, 2021
(a) U-Net [8] (b) Attention U-Net [10] (c) U-Net++ [13]
(d) Attention U-Net++ [11] (e) Attention gate [10]
(f) Illustra-
tion details
Figure 1: Illustrations of deterministic deep learning U-Net based models used in this study.
on a different dataset. Robustness of automatic segmentation models in different datasets is utmost important in
medical image analysis because test image/data can come from different hospitals, health care centers, or equipment
manufacturers.
In this study, we investigate the robustness of U-Net based deep learning models for WMHs segmentation in two different
datasets. Furthermore, we also investigate the robustness of two deep learning approaches, which are deterministic and
probabilistic deep learning models, for WMHs segmentation. All codes and trained model are available on our GitHub
page (https://github.com/rizalmaulanaa/Robustness_of_Prob_U_Net).
2 Related Works
U-Net [
8
] has been widely used in many segmentation tasks, especially in biomedical image segmentation tasks,
because it can work efficiently with limited training data [
9
]. There are three components in the original U-Net, which
are encoder, decoder, and skip connections (see Fig. 1a). Skip connections have an important role in U-Net, which
combine coarse-grained feature maps from each decoder with fine-grained feature maps from each encoder [
13
]. Many
studies tried to improved U-Net by adding a new module or redesign the networks themselves [9, 10, 13, 14].
Attention U-Net was proposed in [
10
] (see Fig. 1b) by adding attention gate to the U-Net. The purpose of attention gate
is to make the U-Net focuses on places that have high relevance to the target labels [
10
]. Attention gate (see Fig. 1e)
was first introduced in Natural Language Processing (NLP) and it is now commonly used in Computer Vision [
10
,
11
].
Fig. 1e shows the flow of attention gate where it has two inputs that come from skip connection and up-sampling signal.
Up-sampling signal is used to enrich information from lower level while skip connection is used to retained information
from the encoder. Lastly, the output from attention gate is obtained by performing element-wise multiplication between
Attention Coefficient (α) and skip connection [10].
U-Net++ was proposed in [
13
] by redesigning U-Net’s skip connections. Instead of sending the semantic information
from encoder to decoder directly, U-Net++ uses another set of convolutional blocks for extracting more features (see
Fig. 1c). Thus, U-Net++ is basically a nested U-Net in different levels of semantic information. U-Net++ also employs
Deep Supervision (DS) (seen as yellow lines in Fig. 1c) which averages all segmentation results from different branches
of semantic information [
13
]. However, DS is optional and the output segmentation is produced by the last decoder if
DS is not used. U-Net++ also can be combined with attention gate to become Attention U-Net++ [11] (see Fig. 1d).
Probabilistic U-Net [
14
] was firstly proposed for semantic segmentation of ambiguous images such as in medical
imaging. For example, different experts can produce different manual labels from one lung nodule in CT scan
[
15
]. Probabilistic U-Net employs conditional variational autoencoder (CVAE) for obtaining complex prior/posterior
distribution to capture and model uncertainties from images. In the inference, a random sample from the learned
distribution is used to produce variations of semantic segmentation.
The main difference between Probabilistic U-Net (probabilistic model) and the U-Net (deterministic model) is Proba-
bilistic U-Net has an additional process to learn useful embedding in latent space for capturing variations of semantic
segmentation (see Fig. 2). During training, Probabilistic U-Net’s Posterior Net will learn to produce a latent space that
2
APREPRINT - NOV EM BER 18, 2021
(a) Training process of Probabilistic U-Net [14] (b) Sampling process of Probabilistic U-Net [14]
Figure 2: Illustrations of (a) training process and (b) sampling process (i.e., inference after training process is finished)
of Probabilistic U-Net [14].
(a) Histogram of WMHs vol-
umes in different datasets.
(b) Histogram of WMHs vol-
umes in different institutions.
(c) Histogram of WMHs inten-
sities in different datasets.
(d) Histogram of WMHs inten-
sities in different institutions.
Figure 3: Distributions of volumes of WMHs clusters in every slices (i.e., (a) and (b)) and distributions of WMHs’
intensities (i.e., (c) and (d)) for each dataset. In (b) and (d), the Challenge dataset is divided into institutions that make it
up which are Singapore, GE3T, and Utrecht.
can capture variations of segmentation from ground truths and medical image. On the other hand, Probabilistic U-Net’s
Prior Net will try to produce the same latent space but only using the medical image. Kullback-Leibler Divergence score
is used to minimize differences between posterior distribution (from the Posterior Net) and prior distribution (from the
Prior Net). In sampling process (i.e., inference after training), the Prior Net is used to sample multiple
z
where each of
them generates a variation of semantic segmentation. Each sample
z
is broadcasted to the same height and width of
U-Net’s last feature maps, and then it is concatenated with the U-Net’s last feature maps before feed-forwarded to the
segmentation layer.
Most previous studies produced best performances when deep learning models are trained and tested by using train and
test sets that are from the same dataset [
4
]. To tackle the problem of robustness in different datasets, recent studies
utilised ensemble [
16
] or multiple branches [
17
] deep learning models. It was indicated that performance of deep
learning models can be affected by different intensity distribution of images obtained by different MRI scanners [17].
3 Methodology
3.1 Datasets and Pre-processing Methods
For testing the robustness of U-Net based models in different datasets, two different datasets from the Alzheimer’s
Disease Neuroimaging Initiative (ADNI)
1
[
18
] and WMH Segmentation Challenge [
19
] were chosen. Note that each
dataset has unique characteristics, e.g., they have different resolutions, amounts of slices, etc. In this study, we only
used T2-FLAIR brain MRI scans from both datasets for training.
For the ADNI dataset, we used a subset of ADNI dataset that has been used in a lot of previous studies [
20
,
21
,
22
,
12
,
7
]
which contains data from 20 patients where each patient have 3 MRI scans from different time points (the total is
1
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database
(
http://adni.loni.usc.edu/
). As such, the investigators within the ADNI contributed to the design and implementation of
ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can
be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
3
APREPRINT - NOV EM BER 18, 2021
60 MRI scans). All T2-FLAIR MRI scans have the same dimension of
256 ×256 ×35
pixels where each voxel is
3.69 mm3
. For more details on this dataset (e.g., data acquisition protocol parameters, ground truth creation, etc.),
please see [21] and data-share page2.
For the WMH Segmentation Challenge
3
(hereinafter referred to as Challenge dataset), it contains data from three
different institutions (i.e., Singapore, GE3T, and Utrecht) where each institution has 20 patients (the total is 60 MRI
scans). MRI scans from Singapore, GE3T, and Utrecht have dimension of
232×256×48
,
132×256×83
,
240×240×48
pixels where each voxel is
3mm3
,
3.51 mm3
,
2.75 mm3
respectively. The Challenge dataset is used for testing the
robustness of tested models when segmenting WMHs in brain MRI from different dataset. Data acquisition protocol
parameters for the Challenge datasets can be found in [19].
To further highlight differences between each dataset, we created histograms which shows distributions of volumes of
WMHs clusters in every slices and distributions of WMHs’ intensities for each dataset in Fig. 3. We can see that the
Challenge dataset has WMHs with higher intensities and larger WMHs clusters in every slices than the ADNI dataset
(see Fig. 3c and Fig. 3a respectively). Furthermore, if the Challenge dataset divided into institutions that make it up,
then we can see that each institution has its own distributions of WMHs intensity and volumes of WMHs clusters in
every slices (see Fig. 3d and Fig. 3b respectively). These kinds of difference have been found to be be affecting the
performance of deep learning model [21, 22].
All T2-FLAIR brain MRI scans were pre-processed and augmented before used in training process of U-Net based
models. Firstly, Bias Field Correction (BFC) [
23
] was used to normalize frequency on MRI scans. Secondly, skull
strip method namely Brain Extraction Tool (BET) [
24
] was used to extract brain tissues from skull. Lastly, each slice
from all MRI scans was normalized by using Z-Score Normalization. MRI scans in the Challenge dataset that are from
different institutions have different dimensions, so zero padding was done on the edges of every slices before used in
training process. The final dimension is
240 ×256
. For data augmentation, horizontal flip and rotation were used with
probabilities of 0.5 and 0.8 respectively by using Albumentations [25].
Table 1: Evaluation on
k
-fold cross validation using ADNI dataset (left) and Challenge dataset (right) in Dice similarity
coefficient (DSC), mean square error (MSE), and Bland-Altman plot’s criteria (i.e., mean of volume error (MVE) and
lower and upper limits of agreement (LoA)). Higher DSC value is better (
), lower MSE value is better (
), and closer
to 0 is better for MVE and lower/upper LoA (
0
). The best result for each column is shown in bold and the second
best is underlined.
Model
ADNI Dataset Challenge Dataset
DSC (std) MSE (std) [ml] MVE Lower / Upper DSC (std) MSE (std) [ml] MVE Lower / Upper
[ml] (0)LoA [ml] (0)[ml] (0)LoA [ml] (0)
U-Net [8] 0.5332 (0.118) 11.6746 (14.465) -0.8423 -7.3873 / 5.7028 0.6150 (0.200) 80.9358 (142.906) -4.1343 -19.9273 / 11.6586
Attention U-Net [10] 0.5040 (0.120) 21.6698 (27.763) -1.6193 -10.2456 / 7.0070 0.6341 (0.183) 59.0076 (97.101) -4.0066 -16.9608 / 8.9476
U-Net++ [13] 0.5469 (0.123) 10.8643 (14.960) -0.4957 -6.9365 / 5.9451 0.6414 (0.179) 57.0659 (121.059) -2.6014 -16.6193 / 11.4166
Attention U-Net++ 0.5333 (0.143) 13.3631 (29.532) 0.6475 -6.4637 / 7.7586 0.6370 (0.181) 59.6467 (132.204) -2.5966 -16.9730 / 11.7799
U-Net++ w/ DS [13] 0.4479 (0.138) 30.2769 (36.275) -3.2570 -12.0229 / 5.5088 0.6302 (0.168) 65.4240 (131.189) -3.0973 -17.8660 / 11.6714
Attention U-Net++ w/ DS 0.4338 (0.167) 13.2083 (12.342) -2.1904 -7.9225 / 3.5417 0.5993 (0.207) 66.7030 (102.184) -4.3073 -18.0226 / 9.4080
Probabilistic U-Net [14] 0.5597 (0.115) 5.8978 (13.682) 0.0466 -4.7526 / 4.8458 0.6831 (0.164) 48.0402 (142.490) 0.9380 -12.6356 / 14.5116
3.2 Experimental Setup
All methods that have been previously introduced in Section 2, which are U-Net (baseline model), Attention U-Net,
U-Net++, Attention U-Net++, and Probabilistic U-Net, were used in this study. However, Attention U-Net++ in this
study is not the same as the Attention U-Net++ proposed in [
11
] where down-sampling signals from ensemble block
to decoder block (red arrows in Fig. 1d) are not used in this study’s Attention U-Net++. In other words, this study’s
Attention U-Net++ is constructed by adding attention gate only to the U-Net++.
For all experiments, we used Adam optimizer with learning rate 0.001, 16 for the batch size, Focal Loss (FL) for cost
function [
26
], and 50 epochs for training each model. FL was introduced as a solution for extremely imbalance data
between foreground and background (e.g., 1:1000). Due to the nature of WMHs, FL is more effective than Cross
Entropy (CE) for segmenting small clusters of WMHs. Note that, in this study, FL was chosen over CE based on
preliminary experiments where FL outperformed CE. FL is defined as:
pt =pif y=1
1potherwise (1)
FL(pt) = αt(1 pt)γlog(pt)(2)
2https://datashare.ed.ac.uk/handle/10283/2214
3https://wmh.isi.uu.nl/
4
APREPRINT - NOV EM BER 18, 2021
where
p
is predicted result and
y
is the ground truth. Parameters
α
and
γ
have values from 0 to 1 and 0 to 5 respectively.
If
γ= 0
, then FL is equivalent to CE. Whereas,
α
is an array of weights used for balancing the data. Based on our
preliminary experiments, we found that FL with
γ= 1.0
and
α= 0.5
performed best on deterministic models while
γ= 0.25
and
α= 0.5
performed best on the probabilistic model. For the Probabilistic U-Net, an additional Adam
optimizer was used for optimizing the Prior Net and Posterior Net.
In this study, we performed two different experiments namely
k
-fold cross validation and cross dataset experiments.
K
-fold cross validation experiment was performed to evaluate the performances of all tested models where training
and testing sets are from the same dataset. On the other hand, cross dataset experiment was performed to evaluate the
performances of all tested models on segmenting WMHs in T2-FLAIR brain MRI scans from different dataset. In the
k
-fold cross validation experiment, we performed patient level cross validation with
k= 2
(i.e., 10 patients are used for
both training and testing in each fold for the ADNI dataset and 30 patients are used for both training and testing for the
Challenge dataset). On the other hand, in the cross dataset experiment, all T2-FLAIR brain MRI scans of a dataset are
used for training while all T2-FLAIR brain MRI scans from the other dataset are used for testing.
3.3 Evaluation Measurements
To evaluate the performance of models tested in this study, Dice Similarity Coefficient (DSC), Mean Square Error
(MSE), and Bland-Altman [
27
] criteria were used. DSC is used to measure spatial similarity between the ground truth
and predicted segmentation. On the other hand, MSE is used for calculating errors between the true volume of WMHs
and predicted volume of WMHs. Whereas, Bland-Altman criteria and plot are used to evaluate the agreement/reliability
in predicting the volumes of WMHs and commonly used in clinical setting. The Bland-Altman criteria are the mean
volumetric difference between the true volume of WMHs and predicted volume of WMhs (hereinafter referred to
as mean volume error (MVE)) and lower and upper limit of agreements (LoA). Lower/upper LoA can be calculated
by using the following equation: MVE
±
1.96
×
standard deviation (std) of MVE. For the Probabilistic U-Net, the
final predicted segmentation of WMHs is an average of 30 variations of predicted segmentation (i.e., by sampling 30
different zfrom the Prior Net).
Figure 4: Visualisations of WMHs ground truth (GT) and predicted WMHs segmentation by tested models from
k
-fold cross validation experiment (above) and cross dataset experiment (bellow) after binarisation. Red regions are
true/predicted WMHs. Volume of WMHs in the particular slice and Dice Similarity Coefficient (DSC) are written at the
bottom left side.
4 Results and Discussions
4.1 K-Fold Cross Validation Experiment
Table 1 shows the results for
k
-fold cross validation experiment for both datasets used in this study (i.e., ADNI dataset
(left) and Challenge dataset (right)). From the table, it is clear that Probabilistic U-Net produced the best results in DSC,
MSE, and Bland-Altman criteria evaluation measurements in both datasets. However, the Probabilistic U-Net did not
performed best for the Upper LoA evaluation measurement.
Also based on the Table 1, adding attention gate did not improve the performances of U-Net, U-Net++, and U-Net++
with DS in general. However, it is worth mentioning that attention gate improved the performance of U-Net in DSC
and MSE measurements in the Challenge dataset. does not yield a higher DSC from baseline model (e.g., U-Net++ to
5
APREPRINT - NOV EM BER 18, 2021
Figure 5: Bland-Altman plots produced by U-Net, U-Net++, and Probabilistic U-Net in the
k
-fold cross validation
experiment tested on ADNI dataset (left) and Challenge dataset (right). Yellow lines are Upper limit of agreement
(LoA), green lines are mean of volume error (MVE), and red lines are Lower LoA. The closer the lines are to the value
0 on the y-axis, the better.
Attention U-Net++). Furthermore, DS module did not improve the performances of U-Net++ and Attention U-Net++
models in almost all evaluation measurements. These findings indicate that attention gate and DS module are not very
effective for automatic WMHs segmentation.
Qualitative/visual assessment can be seen in Fig. 4 (upper side is for the
k
-fold cross validation experiment while the
lower side is for the cross dataset experiment), and DSC measurements produced by the tested models are also shown.
Bland-Altman criteria for U-Net, U-Net++, and Probabilistic U-Net in both datasets listed in Table 1 are plotted as
plots shown in Fig. 5. The green line is for the MVE while red and orange lines are for the lower LoA and upper
LoA respectively. Each dots represent each patient in the ADNI dataset. The red line is lower LoA, the green is mean
(between ground truth and prediction), and the yellow line is upper LoA. The closer the lines are to the value 0 on the
y
-axis, the better. Thus, we can see that the Probability U-Net is better than the U-Net and the U-Net++ on estimating
the volumes of WMHs in
k
-fold cross validation experiment in both datasets. However, it is worth mentioning that
there are still some outliers in the estimation (i.e., outside the interval of lower and upper LoAs).
Table 2: Evaluation on cross dataset experiment using ADNI dataset for training and Challenge dataset for testing in
Dice similarity coefficient (DSC), mean square error (MSE), and Bland-Altman plot’s criteria (i.e., mean of volume
error (MVE)). Higher DSC value is better (
), lower MSE value is better (
), and closer to 0 is better for MVE (
0
).
The best result for each column is shown in bold and the second best is underlined.
Model DSC (std) Average DSC MSE [ml] MVE (std)
Singapore GE3T Utrecht (std) [ml] (0)
U-Net [8] 0.6459 (0.194) 0.6368 (0.128) 0.5964 (0.208) 0.6264 (0.178) 57.6554 (40.698) 0.1454 (7.140)
Attention U-Net [10] 0.6567 (0.185) 0.6283 (0.123) 0.5798 (0.210) 0.6216 (0.176) 64.6727 (43.113) -1.2613 (7.065)
U-Net++ [13] 0.6584 (0.183) 0.6159 (0.120) 0.5876 (0.208) 0.6206 (0.174) 74.8881 (44.480) -1.7248 (7.202)
Attention U-Net++ 0.6653 (0.178) 0.6551 (0.092) 0.5592 (0.207) 0.6265 (0.170) 85.5528 (52.614) -2.8858 (7.607)
U-Net++ w/ DS [13] 0.6520 (0.171) 0.6508 (0.101) 0.5637 (0.224) 0.6222 (0.175) 79.7682 (65.307) -0.4982 (7.636)
Attention U-Net++ w/ DS 0.6273 (0.199) 0.5596 (0.136) 0.5467 (0.224) 0.5779 (0.190) 82.9248 (39.542) -2.9455 (7.479)
Probabilistic U-Net [14] 0.6430 (0.170) 0.6934 (0.086) 0.5641 (0.200) 0.6335 (0.166) 114.5996 (107.905) 4.2042 (8.486)
Table 3: Evaluation on cross dataset experiment using Challenge dataset for training and ADNI dataset for testing in
Dice similarity coefficient (DSC), mean square error (MSE), and Bland-Altman plot’s criteria (i.e., mean of volume
error (MVE) and lower/upper limits of agreement (LoA)). Higher DSC is better (
), lower MSE is better (
), and closer
to 0 is better for MVE and lower/upper LoA (
0
). The best result for each column is shown in bold and the second
best is underlined.
Model DSC MSE [ml] MVE Lower / Upper
[ml] (0)LoA [ml] (0)
U-Net [8] 0.5346 (0.164) 16.9374 (34.075) 2.1448 -4.7976 / 9.0873
Attention U-Net [10] 0.4999 (0.156) 24.4821 (56.104) 0.7700 -8.8906 / 10.4307
U-Net++ [13] 0.5285 (0.158) 17.1473 (30.529) 1.2444 -6.5620 / 9.0508
Attention U-Net++ 0.5021 (0.164) 20.3446 (39.623) 1.7858 -6.4009 / 9.9725
U-Net++ w/ DS [13] 0.4616 (0.179) 21.9043 (37.378) 1.7875 -6.7618 / 10.3369
Attention U-Net++ w/ DS 0.4885 (0.186) 18.7165 (35.286) 2.1043 -5.3670 / 9.5756
Probabilistic U-Net [14] 0.4809 (0.187) 25.1539 (44.206) 2.7703 -5.4933 / 11.0339
4.2 Cross Dataset Experiment
Table 2 shows the results for cross dataset experiment where ADNI dataset was used for training while Challenge
dataset was used for testing. Whereas, Table 3 shows the results for cross dataset experiment where Challenge dataset
was used for training while ADNI dataset was used for testing.
6
APREPRINT - NOV EM BER 18, 2021
Figure 6: Ambiguity maps of the same slice from
k
-fold cross validation experiment (middle) and cross dataset
experiment (right), which are calculated by using Cross Entropy (CE) between the mean sample and all samples
(γ(s) = E[CE(¯s,s)]).
From Table 2, we can see that the Probabilistic U-Net performed the best in the average DSC and DSC for GE3T
sub-dataset. However, the Probabilistic U-Net failed to produce the best results for other evaluation measurements.
We can see that the best results for each evaluation measurement produced by different models, and, specifically, the
original U-Net produced the best results for most of them, which are in DSC for Utrecht sub-dataset, MSE, and MVE.
From Table 3, we can see that the Probabilistic U-Net failed to produce the best results in every evaluation measurements.
Instead, the original U-Net successfully produced the best results in almost all evaluation measurements except for the
MSE. It is also worth mentioning that U-Net++ is the second best performer in almost all evaluation measurements.
4.3 The Performance of Probabilistic U-Net
From the cross dataset experiment, we found that the Probabilistic U-Net performed worse than the other models
especially when data from different institutions are put together (i.e., the Challenge dataset) and used for training
process. We hypothesise that the Probabilistic U-Net captures different uncertainties/ambiguities when trained in
k
-fold
cross validation and cross dataset experiment. To prove this, we created ambiguity maps for each experiment (Fig. 6).
Ambiguity map can be created by generating variances of predicted segmentation using Probabilistic U-Net and then
calculating the CE between the mean predicted segmentation and all variances of predicted segmentation (see Fig. 6).
From the second column of Fig. 6, we can see that ambiguity map produced by Probabilistic U-Net in the
k
-fold cross
validation experiment has high uncertainties in normal tissues of the brain that appear like or have similar textures to
the WMHs. In contrast, from the third column of Fig. 6, we can see that ambiguity map produced by Probabilistic
U-Net in the cross dataset experiment has high uncertainties around the borders of WMHs. This shows that the
Probabilistic U-Net captured ambiguity of WMHs’ manual labels produced by different raters, like in the original study
of Probabilistic U-Net [
14
], in the cross dataset experiment. We believe this happened due to different distributions of
WMHs intensity and volumes of WMHs clusters in every slices between ADNI dataset and Challenge dataset as seen in
Fig. 3. Also, note that each institution in the Challenge dataset has different distributions as well which makes a lot of
uncertainties in the dataset.
5 Conclusion and Future Work
In this study, we investigated the robustness of different deep learning models based on U-Net for automatic segmentation
of WMHs in different datasets of brain MRI. We also investigated the robustness of Probabilistic U-Net (i.e., a
probabilistic model) and compared its performance to the original U-Net, Attention U-Net, U-Net++, Attention U-
Net++, and their variances (i.e., deterministic models). It is worth mentioning that, all models were tested by using their
best hyper-parameters found in the preliminary experiments.
Based on
k
-fold cross validation experiment, Probabilistic U-Net outperformed all other tested models in all evaluation
measurements (i.e., DSC, MSE, and Bland-Altman criteria/plot). However, we found that Probabilistic U-Net was
outperformed by the original U-Net in cross dataset experiment in some evaluation measurements, especially when the
Challenge dataset was used for training.
Based on the ambiguity maps produced by the Probabilistic U-Net, we found that it captures different types of
uncertainty in different experiments. In the
k
-fold cross validation experiment, uncertainties between WMHs and
non-WMHs were captured by the Probabilistic U-Net. On the other hand, uncertainties that are concentrated around the
borders of WMHs were captured in the cross dataset experiment. Thus, in the future, we would like to find a way to
improve the robustness of Probabilistic U-Net in different dataset.
7
APREPRINT - NOV EM BER 18, 2021
Acknowledgment
We gratefully acknowledge the support from Tokopedia-UI AI Center, Faculty of Computer Science, University of
Indonesia, for the NVIDIA DGX-1 that we used for running the experiments. MFR is with the Special Postdoctoral
Researchers Program, RIKEN.
Data collection and sharing for this project was partially funded by the Alzheimer’s Disease Neuroimaging Initiative
(ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number
W81XWH-12-2-0012). The grantee organization is the Northern California Institute for Research and Education, and
the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California.
ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
References
[1]
Joanna M. Wardlaw, Francesca M. Chappell, Maria del Carmen Valdés Hernández, Stephen D.J. Makin, Julie
Staals, Kirsten Shuler, Michael J. Thrippleton, Paul A. Armitage, Susana Muñoz-Maniega, Anna K. Heye, Eleni
Sakka, and Martin S. Dennis. White matter hyperintensity reduction and outcomes after minor stroke. Neurology,
89(10):1003–1010, 2017.
[2]
Muhammad Febrian Rachmadi, Maria del C. Valdés-Hernández, Stephen Makin, Joanna Wardlaw, and Taku
Komura. Automatic spatial estimation of white matter hyperintensities evolution in brain mri using disease
evolution predictor deep neural networks. Medical Image Analysis, 63:101712, 2020.
[3]
Karen Misquitta, Mahsa Dadar, D. Louis Collins, and Maria Carmela Tartaglia. White matter hyperintensities
and neuropsychiatric symptoms in mild cognitive impairment and alzheimer’s disease. NeuroImage: Clinical,
28:102367, 2020.
[4]
Ramya Balakrishnan, Maria del C. Valdés Hernández, and Andrew J. Farrall. Automatic segmentation of white
matter hyperintensities from brain magnetic resonance images in the era of deep learning and big data a
systematic review. Computerized Medical Imaging and Graphics, 88:101867, 2021.
[5]
Pauline Maillard, Evan Fletcher, Danielle Harvey, Owen Carmichael, Bruce Reed, Dan Mungas, and Charles
DeCarli. White matter hyperintensity penumbra. Stroke, 42(7):1917–1922, 2011.
[6]
Maria del C Valdés Hernández, Karen J Ferguson, Francesca M Chappell, and Joanna M Wardlaw. New
multispectral mri data fusion technique for white matter lesion segmentation: method and comparison with
thresholding in flair images. European radiology, 20(7):1684–1691, 2010.
[7]
Muhammad Febrian Rachmadi, Maria del C Valdés-Hernández, Hongwei Li, Ricardo Guerrero, Rozanna Mei-
jboom, Stewart Wiseman, Adam Waldman, Jianguo Zhang, Daniel Rueckert, Joanna Wardlaw, et al. Limited
one-time sampling irregularity map (lots-im) for automatic unsupervised assessment of white matter hyperintensi-
ties and multiple sclerosis lesions in structural brain magnetic resonance images. Computerized Medical Imaging
and Graphics, 79:101685, 2020.
[8]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image
segmentation. CoRR, abs/1505.04597, 2015.
[9]
Jiong Wu, Yue Zhang, Kai Wang, and Xiaoying Tang. Skip connection u-net for white matter hyperintensities
segmentation from mri. IEEE Access, 7:155194–155202, 2019.
[10]
Ozan Oktay, Jo Schlemper, Loïc Le Folgoc, Matthew C. H. Lee, Mattias P. Heinrich, Kazunari Misawa, Kensaku
Mori, Steven G. McDonagh, Nils Y. Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention
u-net: Learning where to look for the pancreas. CoRR, abs/1804.03999, 2018.
[11]
Chen Li, Yusong Tan, Wei Chen, Xin Luo, Yuanming Gao, Xiaogang Jia, and Zhiying Wang. Attention unet++: A
nested attention-aware u-net for liver ct image segmentation. In 2020 IEEE International Conference on Image
Processing (ICIP), pages 345–349, 2020.
[12]
Yunhee Jeong, Muhammad Febrian Rachmadi, Maria del C Valdés-Hernández, and Taku Komura. Dilated saliency
u-net for white matter hyperintensities segmentation using irregularity age map. Frontiers in aging neuroscience,
11:150, 2019.
[13]
Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: Redesigning
skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging,
2019.
8
APREPRINT - NOV EM BER 18, 2021
[14]
Simon AA Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R Ledsam, Klaus H
Maier-Hein, SM Eslami, Danilo Jimenez Rezende, and Olaf Ronneberger. A probabilistic u-net for segmentation
of ambiguous images. arXiv preprint arXiv:1806.05034, 2018.
[15]
Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P
Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database
consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on
ct scans. Medical physics, 38(2):915–931, 2011.
[16]
Konstantinos Kamnitsas, Wenjia Bai, Enzo Ferrante, Steven G. McDonagh, Matthew Sinclair, Nick Pawlowski,
Martin Rajchl, Matthew C. H. Lee, Bernhard Kainz, Daniel Rueckert, and Ben Glocker. Ensembles of multiple
models and architectures for robust brain tumour segmentation. CoRR, abs/1711.01468, 2017.
[17]
Naoya Furuhashi, Shiho Okuhata, and Tetsuo Kobayashi. A robust and accurate deep-learning-based method
for the segmentation of subcortical brain: Cross-dataset evaluation of generalization performance. Magnetic
Resonance in Medical Sciences, 20(2):166–174, 2021.
[18]
Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford Jack, William Jagust, John Q
Trojanowski, Arthur W Toga, and Laurel Beckett. The alzheimer’s disease neuroimaging initiative. Neuroimaging
Clinics of North America, 15(4):869, 2005.
[19]
Hugo J. Kuijf, J. Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen, Simon Andermatt, Mariana Bento, Matt
Berseth, Mikhail Belyaev, M. Jorge Cardoso, Adrià Casamitjana, D. Louis Collins, Mahsa Dadar, Achilleas
Georgiou, Mohsen Ghafoorian, Dakai Jin, April Khademi, Jesse Knight, Hongwei Li, Xavier Lladó, Miguel
Luna, Qaiser Mahmood, Richard McKinley, Alireza Mehrtash, Sébastien Ourselin, Bo-Yong Park, Hyunjin Park,
Sang Hyun Park, Simon Pezold, Elodie Puybareau, Leticia Rittner, Carole H. Sudre, Sergi Valverde, Verónica
Vilaplana, Roland Wiest, Yongchao Xu, Ziyue Xu, Guodong Zeng, Jianguo Zhang, Guoyan Zheng, Christopher
Chen, Wiesje van der Flier, Frederik Barkhof, Max A. Viergever, and Geert Jan Biessels. Standardized assessment
of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE
Transactions on Medical Imaging, 38(11):2556–2568, 2019.
[20]
Maria del C Valdés-Hernández. Reference segmentations of white matter hyperintensities from a subset of 20
subjects scanned three consecutive years, Dec 2016.
[21]
Muhammad Febrian Rachmadi, Maria del C Valdés-Hernández, Maria Leonora Fatimah Agan, and Taku Komura.
Deep learning vs. conventional machine learning: Pilot study of wmh segmentation in brain mri with absence or
mild vascular pathology. Journal of Imaging, 3(4):66, 2017.
[22]
Muhammad Febrian Rachmadi, Maria del C Valdes-Hernandez, Maria Leonora Fatimah Agan, Carol Di Perri,
Taku Komura, Alzheimer’s Disease Neuroimaging Initiative, et al. Segmentation of white matter hyperintensities
using convolutional neural networks with global spatial information in routine clinical brain mri with none or mild
vascular pathology. Computerized Medical Imaging and Graphics, 66:28–43, 2018.
[23]
J.G. Sled, A.P. Zijdenbos, and A.C. Evans. A nonparametric method for automatic correction of intensity
nonuniformity in mri data. IEEE Transactions on Medical Imaging, 17(1):87–97, 1998.
[24]
Mark Jenkinson, Mickael Pechaud, Stephen Smith, et al. Bet2: Mr-based estimation of brain, skull and scalp
surfaces. In Eleventh annual meeting of the organization for human brain mapping, volume 17, page 167. Toronto.,
2005.
[25]
Alexander Buslaev, Vladimir I. Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A.
Kalinin. Albumentations: Fast and flexible image augmentations. Information, 11(2), 2020.
[26]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection,
2018.
[27]
J. Martin Bland and DouglasG. Altman. Statistical methods for assessing agreement between two methods of
clinical measurement. The Lancet, 327(8476):307–310, 1986. Originally published as Volume 1, Issue 8476.
9
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background White matter hyperintensities (WMH), of presumed vascular origin, are visible and quantifiable neuroradiological markers of brain parenchymal change. These changes may range from damage secondary to inflammation and other neurological conditions, through to healthy ageing. Fully automatic WMH quantification methods are promising, but still, traditional semi-automatic methods seem to be preferred in clinical research. We systematically reviewed the literature for fully automatic methods developed in the last five years, to assess what are considered state-of-the-art techniques, as well as trends in the analysis of WMH of presumed vascular origin. Method We registered the systematic review protocol with the International Prospective Register of Systematic Reviews (PROSPERO), registration number - CRD42019132200. We conducted the search for fully automatic methods developed from 2015 to July 2020 on Medline, Science direct, IEE Explore, and Web of Science. We assessed risk of bias and applicability of the studies using QUADAS 2. Results The search yielded 2327 papers after removing 104 duplicates. After screening titles, abstracts and full text, 37 were selected for detailed analysis. Of these, 16 proposed a supervised segmentation method, 10 proposed an unsupervised segmentation method, and 11 proposed a deep learning segmentation method. Average DSC values ranged from 0.538 to 0.91, being the highest value obtained from an unsupervised segmentation method. Only four studies validated their method in longitudinal samples, and eight performed an additional validation using clinical parameters. Only 8/37 studies made available their methods in public repositories. Conclusions We found no evidence that favours deep learning methods over the more established k-NN, linear regression and unsupervised methods in this task. Data and code availability, bias in study design and ground truth generation influence the wider validation and applicability of these methods in clinical research.
Article
Full-text available
Neuropsychiatric symptoms (NPS), such as apathy, irritability and depression, are frequently encountered in patients with Alzheimer’s disease (AD). Focal grey matter atrophy has been linked to NPS development. Cerebrovascular disease is common among AD patients and can be detected on MRI as white matter hyperintensities (WMH). In this longitudinal study, the relative contribution of WMH burden and GM atrophy to NPS was evaluated in a cohort of mild cognitive impairment (MCI), AD and normal controls. This study included 121 AD, 315 MCI and 225 normal control subjects from the Alzheimer’s Disease Neuroimaging Initiative. NPS were assessed using the Neuropsychiatric Inventory and grouped into hyperactivity, psychosis, affective and apathy subsyndromes. WMH were measured using an automatic segmentation technique and mean deformation-based morphometry (DBM) was used to measure atrophy of grey matter regions. Linear mixed-effects models found focal grey matter atrophy and WMH volume both contributed significantly to NPS subsyndromes in MCI and AD subjects, however, WMH burden played a greater role. This study could provide a better understanding of the pathophysiology of NPS in AD and support the monitoring and control of vascular risk factors.
Article
Full-text available
Purpose: To analyze subcortical brain volume more reliably, we propose a deep learning segmentation method of subcortical brain based on magnetic resonance imaging (MRI) having high generalization performance, accuracy, and robustness. Methods: First, local images of three-dimensional (3D) bounding boxes were extracted for seven subcortical structures (thalamus, putamen, caudate, pallidum, hippocampus, amygdala, and accumbens) from a whole brain MR image as inputs to the neural network. Second, dilated convolution layers, which input information of variable scope, were introduced to the blocks that make up the neural network. These blocks were connected in parallel to simultaneously process global and local information obtained by the dilated convolution layers. To evaluate generalization performance, different datasets were used for training and testing sessions (cross-dataset evaluation) because subcortical brain segmentation in clinical analysis is assumed to be applied to unknown datasets. Results: The proposed method showed better generalization performance that can obtain stable accuracy for all structures, whereas the state-of-the-art deep learning method obtained extremely low accuracy for some structures. The proposed method performed segmentation for all samples without failing with significantly higher accuracy (P < 0.005) than conventional methods such as 3D U-Net, FreeSurfer, and Functional Magnetic Resonance Imaging of the Brain’s (FMRIB’s) Integrated Registration and Segmentation Tool in the FMRIB Software Library (FSL-FIRST). Moreover, when applying this proposed method to larger datasets, segmentation was robustly performed for all samples without producing segmentation results on the areas that were apparently different from anatomically relevant areas. On the other hand, FSL-FIRST produced segmentation results on the area that were apparently and largely different from the anatomically relevant area for about one-third to one-fourth of the datasets. Conclusion: The cross-dataset evaluation showed that the proposed method is superior to existing methods in terms of generalization performance, accuracy, and robustness.
Article
Full-text available
Previous studies have indicated that white matter hyperintensities (WMH), the main radiological feature of small vessel disease, may evolve (i.e., shrink, grow) or stay stable over a period of time. Predicting these changes are challenging because it involves some unknown clinical risk factors that leads to a non-deterministic prediction task. In this study, we propose a deep learning model to predict the evolution of WMH from baseline to follow-up (i.e., 1-year later), namely “Disease Evolution Predictor” (DEP) model, which can be adjusted to become a non-deterministic model. The DEP model receives a baseline image as input and produces a map called “Disease Evolution Map” (DEM), which represents the evolution of WMH from baseline to follow-up. Two DEP models are proposed, namely DEP-UResNet and DEP-GAN, which are representatives of the supervised (i.e., need expert-generated manual labels to generate the output) and unsupervised (i.e., do not require manual labels produced by experts) deep learning algorithms respectively. To simulate the non-deterministic and unknown parameters involved in WMH evolution, we modulate a Gaussian noise array to the DEP model as auxiliary input. This forces the DEP model to imitate a wider spectrum of alternatives in the prediction results. The alternatives of using other types of auxiliary input instead, such as baseline WMH and stroke lesion loads are also proposed and tested. Based on our experiments, the fully supervised machine learning scheme DEP-UResNet regularly performed better than the DEP-GAN which works in principle without using any expert-generated label (i.e., unsupervised). However, a semi-supervised DEP-GAN model, which uses probability maps produced by a supervised segmentation method in the learning process, yielded similar performances to the DEP-UResNet and performed best in the clinical evaluation. Furthermore, an ablation study showed that an auxiliary input, especially the Gaussian noise, improved the performance of DEP models compared to DEP models that lacked the auxiliary input regardless of the model’s architecture. To the best of our knowledge, this is the first extensive study on modelling WMH evolution using deep learning algorithms, which deals with the non-deterministic nature of WMH evolution.
Article
Full-text available
Data augmentation is a commonly used technique for increasing both the size and the diversity of labeled training sets by leveraging input transformations that preserve corresponding output labels. In computer vision, image augmentations have become a common implicit regularization technique to combat overfitting in deep learning models and are ubiquitously used to improve performance. While most deep learning frameworks implement basic image transformations, the list is typically limited to some variations of flipping, rotating, scaling, and cropping. Moreover, image processing speed varies in existing image augmentation libraries. We present Albumentations, a fast and flexible open source library for image augmentation with many various image transform operations available that is also an easy-to-use wrapper around other augmentation libraries. We discuss the design principles that drove the implementation of Albumentations and give an overview of the key features and distinct capabilities. Finally, we provide examples of image augmentations for different computer vision tasks and demonstrate that Albumentations is faster than other commonly used image augmentation tools on most image transform operations.
Article
Full-text available
The state-of-the-art models for medical image segmentation are variants of U-Net and fully convolutional networks (FCN). Despite their success, these models have two limitations: (1) their optimal depth is apriori unknown, requiring extensive architecture search or inefficient ensemble of models of varying depths; and (2) their skip connections impose an unnecessarily restrictive fusion scheme, forcing aggregation only at the same-scale feature maps of the encoder and decoder sub-networks. To overcome these two limitations, we propose UNet++, a new neural architecture for semantic and instance segmentation, by (1) alleviating the unknown network depth with an efficient ensemble of U-Nets of varying depths, which partially share an encoder and co-learn simultaneously using deep supervision ; (2) redesigning skip connections to aggregate features of varying semantic scales at the decoder sub-networks, leading to a highly flexible feature fusion scheme; and (3) devising a pruning scheme to accelerate the inference speed of UNet++. We have evaluated UNet++ using six different medical image segmentation datasets, covering multiple imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and electron microscopy (EM), and demonstrating that (1) UNet++ consistently outperforms the baseline models for the task of semantic segmentation across different datasets and backbone architectures; (2) UNet++ enhances segmentation quality of varying-size objects-an improvement over the fixed-depth U-Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) outperforms the original Mask R-CNN for the task of instance segmentation; and (4) pruned UNet++ models achieve significant speedup while showing only modest performance degradation. Our implementation and pre-trained models are available at https://github.com/MrGiovanni/UNetPlusPlus.
Article
Full-text available
White matter hyperintensity (WMH) is associated with various aging and neurodegenerative diseases. In this paper, we proposed and validated a fully automatic system which integrates classical image processing and deep neural network for segmenting WHM from fluid attenuation inversion recovery (FLAIR) and T1 magnetic resonance (MR) images. In this system, a novel skip connection U-net (SC Unet) was proposed. In addition, an atlas-based method was introduced in the preprocessing stage to remove non-brain tissues (namely skull-stripping) and thus to improve the segmentation accuracy. Effectiveness of the proposed system was validated on a dataset of 60 paired images based on cross-scanner validation. Our experimental results revealed the effectiveness of the skull-stripping strategy. More importantly, compared to two existing state-of-the-art methods for segmenting WHM, including a U-net-like method and another deep learning method, the proposed SC U-net had a faster convergence, a lower loss and a higher segmentation accuracy. Both quantitative and qualitative analyses (via visual examinations) revealed the superior performance of our proposed SC U-net. The mean dice score of the proposed SC U-net was 78.36% which was much higher than those of a U-net-like method (74.99%) and an alternative deep learning method (74.80%). The software environment and model of the proposed system were made publicly accessible at Dockerhub.
Article
Full-text available
White matter hyperintensities (WMH) appear as regions of abnormally high signal intensity on T2-weighted magnetic resonance image (MRI) sequences. In particular, WMH have been noteworthy in age-related neuroscience for being a crucial biomarker for all types of dementia and brain aging processes. The automatic WMH segmentation is challenging because of their variable intensity range, size and shape. U-Net tackles this problem through the dense prediction and has shown competitive performances not only on WMH segmentation/detection but also on varied image segmentation tasks. However, its network architecture is high complex. In this study, we propose the use of Saliency U-Net and Irregularity map (IAM) to decrease the U-Net architectural complexity without performance loss. We trained Saliency U-Net using both: a T2-FLAIR MRI sequence and its correspondent IAM. Since IAM guides locating image intensity irregularities, in which WMH are possibly included, in the MRI slice, Saliency U-Net performs better than the original U-Net trained only using T2-FLAIR. The best performance was achieved with fewer parameters and shorter training time. Moreover, the application of dilated convolution enhanced Saliency U-Net by recognizing the shape of large WMH more accurately through multi-context learning. This network named Dilated Saliency U-Net improved Dice coefficient score to 0.5588 which was the best score among our experimental models, and recorded a relatively good sensitivity of 0.4747 with the shortest training time and the least number of parameters. In conclusion, based on our experimental results, incorporating IAM through Dilated Saliency U-Net resulted an appropriate approach for WMH segmentation.
Article
Full-text available
Quantification of cerebral white matter hyperintensities (WMH) of presumed vascular origin is of key importance in many neurological research studies. Currently, measurements are often still obtained from manual segmentations on brain MR images, which is a laborious procedure. Automatic WMH segmentation methods exist, but a standardized comparison of the performance of such methods is lacking. We organized a scientific challenge, in which developers could evaluate their method on a standardized multi-center/-scanner image dataset, giving an objective comparison: the WMH Segmentation Challenge (https://wmh.isi.uu.nl/). Sixty T1+FLAIR images from three MR scanners were released with manual WMH segmentations for training. A test set of 110 images from five MR scanners was used for evaluation. Segmentation methods had to be containerized and submitted to the challenge organizers. Five evaluation metrics were used to rank the methods: (1) Dice similarity coefficient, (2) modified Hausdorff distance (95th percentile), (3) absolute log-transformed volume difference, (4) sensitivity for detecting individual lesions, and (5) F1-score for individual lesions. Additionally, methods were ranked on their inter-scanner robustness. Twenty participants submitted their method for evaluation. This paper provides a detailed analysis of the results. In brief, there is a cluster of four methods that rank significantly better than the other methods, with one clear winner. The inter-scanner robustness ranking shows that not all methods generalize to unseen scanners. The challenge remains open for future submissions and provides a public platform for method evaluation.