Content uploaded by Zhi-Qin Zhu
Author content
All content in this area was uploaded by Zhi-Qin Zhu on Jan 03, 2023
Content may be subject to copyright.
Brain Tumor Segmentation Based on the Fusion of Deep Semantics and
Edge Information in Multimodal MRI
Zhiqin Zhua, Xianyu Hea, Guanqiu Qib, Yuanyuan Lia, Baisen Congc, Yu Liud,∗
aCollege of Automation, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
bComputer Information Systems Department, State University of New York at Buffalo State, Buffalo, NY 14222, USA
cDiagnostics Digital, DH(Shanghai) Diagnostics Co, Ltd, a Danaher company, Shanghai, 200335, China
dDepartment of Biomedical Engineering, Hefei University of Technology, Hefei 230009, China
Abstract
Brain tumor segmentation in multimodal MRI has great significance in clinical diagnosis and treatment.
The utilization of multimodal information plays a crucial role in brain tumor segmentation. However, most
existing methods focus on the extraction and selection of deep semantic features, while ignoring some features
with specific meaning and importance to the segmentation problem. In this paper, we propose a brain tumor
segmentation method based on the fusion of deep semantics and edge information in multimodal MRI, aiming
to achieve a more sufficient utilization of multimodal information for accurate segmentation. The proposed
method mainly consists of a semantic segmentation module, an edge detection module and a feature fusion
module. In the semantic segmentation module, the Swin Transformer is adopted to extract semantic features
and a shifted patch tokenization strategy is introduced for better training. The edge detection module is
designed based on convolutional neural networks (CNNs) and an edge spatial attention block (ESAB) is
presented for feature enhancement. The feature fusion module aims to fuse the extracted semantic and edge
features, and we design a multi-feature inference block (MFIB) based on graph convolution to perform feature
reasoning and information dissemination for effective feature fusion. The proposed method is validated on
the popular BraTS benchmarks. The experimental results verify that the proposed method outperforms
a number of state-of-the-art brain tumor segmentation methods. The source code of the proposed
method is available at https://github.com/HXY-99/brats.
Keywords: Brain tumor segmentation, Transformer, convolutional neural networks, edge feature, feature
fusion
1. Introduction
Medical image segmentation is an important topic in the community of medical image processing. Among
them, brain tumor segmentation aims to localize multiple types of tumor regions from images, which is of
∗Corresponding author
Email addresses: zhuzq@cqupt.edu.cn (Zhiqin Zhu), s210301012@stu.cqupt.edu.cn (Xianyu He), qig@buffalostate.edu
(Guanqiu Qi), liyy@cqupt.edu.cn (Yuanyuan Li), bcong@dhdiagnostics.com (Baisen Cong), yuliu@hfut.edu.cn (Yu Liu)
Preprint submitted to Information Fusion October 2, 2022
great significance to clinical practice [1]. Owing to the good capacity in providing high-resolution anatomic
structures for soft-tissues, magnetic resonance imaging (MRI) is mostly utilized in the diagnosis and treat-5
ment of brain tumor diseases. To obtain comprehensive information for accurate segmentation, multimodal
MRI scans with different imaging parameters are usually required in brain tumor segmentation. Commonly-
used modalities include fluid attenuation inversion recovery (FLAIR), T1-weighted (T1), contrast enhanced
T1-weighted (T1ce) and T2-weighted (T2). Images of different modalities capture different pathological
information and they can complement each other effectively [2], which plays a crucial role in segmenting10
multiple types of brain tumor regions such as edema (ED), necrosis and non-enhancing tumor (NCR/NET),
and enhancing tumor (ET). An example of multimodal MRI for brain tumor segmentation is shown in Fig.
1. For simplicity, only a slice is selected from the entire scan. Fig. 1(a) shows the ground truth (GT)
segmentation label provided by domain experts. The green, yellow and red indicate ED, ET and NCR/NET
regions, respectively. From Fig. 1(b)-(e), it can be seen that the characteristics of different modalities vary15
significantly. For example, the FLAIR modality can well capture the ED regions with distinct edges or
boundaries between the tumor and normal tissues, while the T1ce modality is more effective in detecting
the tumor core (i.e., the union of ET and NRC/NET) with high contrast [3].
Figure 1: An example of multimodal MRI for brain tumor segmentation. (a) The ground truth (GT) segmentation label
provided by domain experts (the green, yellow and red represent edema (ED), enhancing tumor (ET), and necrosis and non-
enhancing tumor (NCR/NET), respectively). (b) The FLAIR modality. (c) The T1 modality. (d) The T1ce modality. (e) The
T2 modality.
A variety of brain tumor segmentation approaches have been presented in the literature. In recent years,
deep learning-based methods have emerged as the mainstream in this field [4]. The most popular way is20
to adopt semantic segmentation-oriented convolutional neural networks (CNNs) such as fully convolutional
networks (FCNs) [5], U-Net [6] and V-Net [7] to segment brain tumors. The CNN-based models can well
capture local features in 2D or 3D spaces. However, CNNs are limited by the receptive field of convolutions,
leading to difficulty in characterizing the global dependencies of features, which is essentially an important
clue in semantic segmentation. To address this issue, Transformer-based models [8, 9] have been introduced25
into the study of brain tumor segmentation [10]. By establishing the connection between feature base
units (i.e., tokens) and adopting a self-attention mechanism, the Transformer-based models demonstrate
better capacity in capturing global contextual information. However, the input size of the standard vision
2
Transformer (ViT) [9] is fixed, which causes the problem of excessive computational cost for semantic
segmentation that requires pixel-wise dense prediction. By constructing hierarchical structure to obtain30
feature maps like CNNs, the Swin Transformer [11] well solves this problem and achieves efficient dense
prediction, which exhibits clear advantages for semantic segmentation problems. Nevertheless, the Swin
Transformer still suffers from the problem of low locality inductive bias, which means that it usually requires
a very large amount of training data to obtain satisfactory visual representation [12]. Since the dataset size
of most medical image analysis problems is typically very small, a pre-trained model is generally needed35
when utilizing ViT and its variants including Swin Transformer. However, an appropriate pre-trained model
is not always available in practice and performing pre-training by oneself is not an easy task.
In most existing brain tumor segmentation methods, the multimodal MRI scans are simply stacked
as the model input for semantic segmentation, which may cause the insufficient utilization of multimodal
information [13, 14]. In fact, the role of some modalities tends to be more significant in the segmentation task,40
as they contain more distinctive information. For instance, the FLAIR and T1ce modalities can effectively
capture the edges of multiple types of tumor regions such as ED and ET. The edge information is actually very
important with regard to brain tumor segmentation, as it not only helps to achieve better localization for the
tumors, but also benefits the boundary quality (e.g., sharpness and accuracy) of segmented regions [15, 16].
It is believed that the edge information could be a good complementary to the deep semantic information. It45
is worth noting that although some methods [2, 17, 18] consider the complementary features of multimodal
inputs and explore multimodal fusion accordingly, they mainly focus on the extraction and fusion of deep
semantic features without considering the importance of the specific edge features for segmentation.
In this paper, we propose a brain tumor segmentation method based on the fusion of deep semantics
and edge information in multimodal MRI, aiming to achieve a more sufficient utilization of multimodal50
information for accurate segmentation. Specifically, the proposed segmentation framework consists of three
main modules: semantic segmentation, edge detection and feature fusion. The semantic segmentation mod-
ule adopts the Swin Transformer as the backbone due to its advantages mentioned above. Moreover, we
introduce a shifted patch tokenization strategy [12] into the Swin Transformer to increase its locality in-
duction bias, so as to achieve easier training for small-size datasets. The edge detection module is designed55
based on CNNs to extract edge features from the FLAIR and T1ce modalities by considering their char-
acteristics. The feature fusion module is designed to fuse semantic features and edge features extracted
from MRI of different modalities. This module adopts graph convolution to structure different areas into
different vertices and collects similar semantic features and edge features under the same vertex. It realizes
the reasoning and dissemination of information between semantic features and edge features, leading to the60
improvement of feature fusion effect. Through the above designs, the proposed segmentation framework can
effectively extract and fuse deep semantic features and edge features in multimodal MRI, and experimental
results on BraTS benchmarks in 2018, 2019 and 2020 demonstrate its superior performance in brain tumor
3
segmentation.
In summary, the contributions of this paper are four-fold.65
1. The primary contribution of this paper is that we propose a deep learning-based brain tumor segmen-
tation method that simultaneously utilizes deep semantic features and specific edge features. To the
best of our knowledge, this manner is different from existing works that mostly focus on the extraction
and fusion of semantic features from multimodal MRI. To achieve this target, three modules including
a semantic segmentation module, an edge detection module, and a feature fusion module are designed,70
leading to the following three technical contributions.
2. We present a Swin Transformer-based semantic segmentation module to extract semantic features
for brain tumor segmentation. In particular, to address the problem caused by the lack of locality
inductive bias, we introduce a shifted patch tokenization strategy into the Swin Transformer, leading
to easier training for small-size datasets.75
3. We present a CNN-based edge detection module to extract edge features from the FLAIR and T1ce
modalities. In this module, an edge spatial attention block (ESAB) using Sobel operator is designed
to enhance edge features, which are extracted in a progressive manner.
4. We present a feature fusion module to fuse the extracted deep semantic features and edge features.
Specifically, a multi-feature inference block (MFIB) based on graph convolution is designed to achieve80
effective feature fusion.
2. Related Work
Various methods for brain tumor segmentation have been proposed in recent years. These methods
can be broadly classified into two categories: the generative model-based methods and the discriminative
model-based methods [19]. The generative model-based methods focus on the appearance characteristics85
of tumorous and healthy tissues, thus requiring related domain-specific prior information, which is usually
obtained through probabilistic image atlases. Menze et al. [20] augmented a probabilistic atlas of healthy
tissue priors with a latent atlas of the lesion and derive the estimation algorithm to extract tumor boundaries
and the latent atlas from the image data. Heinrich et al. [21] employed discrete optimization and self-
similarity for multimodal medical image segmentation under discrete medical image registration framework.90
The discriminative model-based methods regard tumor segmentation as a classification problem to determine
the property of voxels. Owing to the rapid development of machine learning techniques, the discriminative
model-based methods have gradually become the main trend in this field. Early methods in this category
4
mainly rely on hand-crafted features such as local histograms [22] and texture features [23], and then employ
discriminative models such as decision tree [24] and conditional random field [25] for classification.95
In the past few years, deep learning has rapidly become the mainstream in the study of brain tumor
segmentation. Some early approaches adopt a patch-based classification strategy and design CNNs to
predict the class of the center voxel of a 2D or 3D patch [19, 26]. However, it is difficult for such patch-
based methods to fully consider the correlation among neighboring patches within a relatively large range.
To address this issue, end-to-end semantic segmentation models such as U-Net [6], attention U-Net [27]100
and U-Net++ [28] have become more popular in brain tumor segmentation. Myronenko [3] proposed a
segmentation network that adds a variational auto-encoder branch to reconstruct the input image for more
effective feature learning. Liu et al. [29] introduced pixel-level image fusion as an auxiliary task to regularize
feature learning and presented a multi-task model for brain tumor segmentation. Isensee et al. [30] proposed
an adaptive framework based on 2D U-Net, 3D U-Net and U-Net Cascade. The framework automatically105
adjusts all hyperparameters without human intervention.
Although the CNN-based methods have achieved great success in brain tumor segmentation, it is known
that CNNs suffer from the limitation of capturing global contextual information, which is a crucial clue
for semantic segmentation. To solve this problem, the Transformer-based methods have gained increasing
attention in the field of medical image segmentation with some representative models being proposed, such as110
the TransUNet [31] that combines Transformer and U-Net, the MedT [32] that presents gated axial-attention
for segmentation. In the study of brain tumor segmentation, the TransBTS proposed by Wang et al. [10] is
the first work that uses Transformer for segmentation and achieves good performance. The above methods
are based on the standard ViT [9], in which the input size should be fixed, leading to high computational
complexity for dense prediction problems such as semantic segmentation. The Swin Transformer [11], which115
adopts hierarchical structure to obtain feature maps like CNNs, can effectively alleviate this problem. This
improvement greatly enhance the potential of Transformer models for semantic segmentation, and motivates
us to adopt the Swin Transformer for brain tumor segmentation in this work. Nevertheless, similar to the
standard ViT, the Swin Transformer still suffers from a defect, i.e., the lack of locality inductive bias [12].
As a consequence, it is pretty hard to utilize the Swin Transformer for small-size datasets without pre-120
training, leading to some inconvenience when applying it to medical image analysis tasks including brain
tumor segmentation, since an appropriate pre-trained model is not always available in practice. To tackle
this problem, we introduce a shifted patch tokenization strategy [12] into the Swin Transformer for brain
tumor segmentation, so that the model can be trained from scratch.
In order to obtain more accurate segmentation results, the use of multimodal MRI data has become an125
interesting topic in brain tumor segmentation. Most existing methods simply adopt an multi-channel input
by stacking multimodal MRI scans and don’t fully consider the difference in terms of their importance to
brain tumor segmentation. Pereira et al. [33] designed a convolutional network for automatic brain tumor
5
segmentation using a four-channel format for multimodal images. Wang et al. [34] achieved the segmentation
of different brain tumor regions by constructing three cascaded networks. In order to use multimodal130
information more effectively, some feature fusion and selection approaches have appeared based on specific
architectures such as attention mechanism. Dolz et al. [35] extended dense connections to multimodal image
segmentation based on DenseNets. Each modality to be input into the network individually is considered
as a branch and dense connections between each branch are used to fuse features from different modalities.
Liu et al. [36] proposed an attention-based modality selection feature fusion module for multimodal feature135
refinement to address the difference among multiple modalities for a given segmentation target. Zhang
et al. [37] used FCN to extract features from images of different modalities, and designed a modality-
aware module for more efficient information exchange across different modalities. Mo et al. [2] divided the
different modalities into main modality and auxiliary modality, and applied the attention mechanism for
feature fusion.140
Although the above-mentioned methods make good efforts to utilize multimodal MRI information, they
all focus on the extraction and selection of deep semantic features, while ignoring some features with specific
meaning and importance to the segmentation problem. In this paper, in addition to the semantic features, we
further concentrate on extracting the edge information of multiple types of tumor regions from some relevant
modalities including FLAIR and T1ce, which are of great significance to improve segmentation quality, since145
the edge information is helpful to obtaining more accurate locations and boundaries of tumors. The extracted
edge features are merged with the semantic features, aiming to utilizes multimodal MRI more effectively
and improve the segmentation accuracy accordingly.
3. The Proposed Method
3.1. Overview150
The framework of the proposed brain tumor segmentation method is shown in Fig. 2. It is mainly
composed of a semantic segmentation module, an edge detection module and a feature fusion module.
The semantic module adopts an improved Swin Transformer block to extract deep semantic features from
multimodal MRI scans including FLAIR, T1, T1ce and T2. The edge detection module aims to extract edge
features by employing a convolutional network as the backbone and designing edge spatial attention blocks155
(ESABs) for feature enhancement. Considering the modal characteristics of MRI modalities, only FLAIR
and T1ce are selected as the input of the edge detection module. The feature fusion module consists of several
multi-feature inference blocks (MFIBs), aiming to fuse the semantic features obtained from the semantic
segmentation module and the edge features obtained from the edge module at multiple levels. To reconstruct
the segmentation result, a successive expanding decoder that is widely adopted in the U-Net-like architectures160
is employed . For the edge detection task, the result is directly obtained by bilinear interpolation. For the
6
Figure 2: The framework of the proposed brain tumor segmentation method. It mainly consists of three modules: a semantic
segmentation module, an edge detection module and a feature fusion module.
feature fusion model, the output includes both an edge detection result and a segmentation result. To train
the network, the semantic segmentation module and edge detection module are first trained individually.
Then, the output of the feature fusion module is used to supervise the training of the entire model.
3.2. Semantic Segmentation Module165
In the semantic segmentation module, we apply the Swin Transformer with an improved patch merging
approach to extract semantic features for the segmentation tasks. The Swin Transformer consists of four
stages, as shown in Fig. 2. For the last three, the original Swin Transformer blocks with a patch merging
step is adopted [11]. For the first one, the steps of patch partition and linear embedding are required prior
to the Swin Transformer blocks. Let X∈RH×W×Cdenote the input, where H×Wrepresents the size170
of the input feature map and Crepresents the number of channels. The input image is first divided into
Mpatches of size P×P, and then each patch is reshaped into a 1D vector xp∈RN×(P2·C). Next, these
patches are flattened and mapped to Ddimension through a trainable linear projection E∈R(P2·C)×Dto
obtain the visual token zinvolving a learnable position variable Epos ∈R(P2·C)×Das
z=xpE+Epos,(1)
where zis input to the Transformer Block as an embedding sequence. Since the Transformer directly divides175
the features, the local information in the patch is difficult to capture, thereby making the Transformer lack
the ability of locality inductive bias.
7
Figure 3: The architecture of a Swin Transformer block with the shifted patch tokenization strategy. Both W-MSA and SW-
MSA are multi-head attention, which represent the regular window and the shifted window, respectively.
To address this problem, as shown in Fig. 3, this paper introduces a shifted patch tokenization strategy
[12], which can embed more spatial information into the visual token, increasing the locality induction ability
of the Transformer to avoid extensive pre-training. Specifically, the input is shifted before patch partition180
by a patch size from four directions, and then the original input and its shifted versions are spliced. Finally,
patch partition and linear embedding are performed.
Each stage in the Swin Transformer consists of Lblocks consisting of multi-head self attention (MSA)
and multilayer perceptron (MLP). The structure of each block is shown in Fig. 3. The layer normalization
(LN) is firstly applied and residual connections are used. The MLP contains two fully connected layers with185
GELU. The above process can be expressed as
ˆzl= W−MSA LN zl−1+zl−1,(2)
zl= MLP LN ˆzl+ ˆzl,(3)
ˆzl+1 = SW−MSA LN zl+zl,(4)
zl+1 = MLP LN ˆzl+1 + ˆzl+1,(5)
where ˆzland zlrepresent the output features of the W-MSA or SW-MSA module and MLP module at the190
l-th block, where W-MSA and SW-MSA denote window-based multi-head self-attention using regular and
shifted window partitioning configurations, respectively. At the end of the Transformer layer, the output zl
goes through a LN layer to obtain the final output z:
z= LN zl.(6)
To generate a hierarchical representation, the Swin Transformer reduces the number of tokens and
8
Figure 4: The architecture of the edge spatial attention block (ESAB) designed for edge feature enhancement.
increases the dimensionality through patch merging. In each patch merging step, the features of adjacent
2×2 patches are concatenated and the concatenated features are linearized to increase the dimensionality.
Specifically, the following output features are obtained after four stages.
X1
seg ∈RH
2×W
2×128, X 2
seg ∈RH
4×W
4×256
X3
seg ∈RH
8×W
8×512, X 4
seg ∈RH
16 ×W
16 ×1024
These features are subsequently input into the feature fusion module and fused with the output features
of the edge detection module to achieve better segmentation performance.195
3.3. Edge Detection Module
The segmentation performance can be improved by supplementing the edge information of brain tumors.
However, the edge features are shallow features. Directly using features of the last convolution block will
force the deep network to capture the shallow edge features, thus affecting the extraction performance of
the edge features. At the same time, the middle layers can also bring rich convolutional features about edge200
information [38, 39]. Therefore, it is necessary to utilize all the convolution layers to obtain richer features.
To this end, this paper designs an edge detection module to utilizes the features of multiple convolution
layers simultaneously. As shown in the edge detection of Fig. 2, after the image is input into the network,
features of different dimensions are extracted through 4 convolution blocks. The convolution block consists
of two 3×3 convolutional layers, a regularization layer and a 2 ×2 max-pooling layer. Specifically, when
the image to be detected is input to the edge detection module, the output features obtained after four
convolution blocks are given as
X1
edge ∈RH
2×W
2×128, X 2
edge ∈RH
4×W
4×256
X3
edge ∈RH
8×W
8×512, X 4
edge ∈RH
16 ×W
16 ×1024
In order to enhance the edge features, the features of each convolution block are refined by an edge
spatial attention block (ESAB). The architecture of the ESAB is shown in Fig. 4. The Sobel convolution
is used in the ESAB. The 1 ×1 convolution reduces the feature volume to a single channel map and obtain
9
the output feature. Furthermore, the output features of a certain layer are added to the output features of205
the next layer, leading to a progressive manner to obtain richer edge features. Finally, the output features
are interpolated to the original input size to reconstruct edge detection result.
3.4. Feature Fusion Module
The feature fusion module that consists of four multi-feature inference blocks (MFIBs) is designed to
fuse semantic features and edge features extracted by the above two modules. In recent years, graph-210
based applications have become more and more widespread and have been verified to be an effective way of
relational reasoning, which makes it a suitable tool to implement multi-feature fusion [40]. In this work, we
design the feature fusion module based on graph convolution by referring to a a recent work on graph-based
global reasoning [41]. The architecture of our MFIB is shown in Fig. 5(a).
The features obtained by the given semantic segmentation module and edge detection module are denoted215
as Xseg ∈RH×W×Cand Xedge ∈RH×W×C, respectively. To achieve better integration of semantics and
features, we map the input features from the spatial domain Xto the graph domain G∈RN×F, where N
represents the number of nodes in the graph and Frepresents the features contained in a node [41]. In this
way, pixels with similar features can be aggregated into a node as an anchor to generate a semantic-aware
graph feature. The feature fusion process is detailed as below. Let Xseg and Xedge denote the input semantic220
and edge features of a given MFIB, respectively. They are mapped to the graph domain to obtain Gseg and
Gedge through two convolutional layers as
Gseg =v(Xseg;Wv)⊗w(Xseg ;Ww),(7)
Gedge =v(Xedge ;Wv)⊗w(Xedge ;Ww),(8)
where v(·) represents the convolution operations used for graph projection and w(·) represents the convolu-
tion operations used for feature dimensionality reduction. Wvand Wwdenote the learnable kernels of v(·)225
and w(·), respectively. The symbol ⊗indicates matrix multiplication. More details of the above projection
can be found in [41].
After projection, in order to learn the relationship between the related node features of the semantic
graph and edge graph, the graph convolution [42] is adopted to learn edge weights corresponding to the
features of each node for reasoning on the fully connected graph. The input of the graph convolution unit230
Gis obtained by adding Gseg and Gedge as
G=Gseg +Gedge.(9)
The architecture of the graph convolution unit is shown in Fig. 5(b), which is implemented by two 1D
10
Figure 5: The architecture of the multi-feature inference block (MFIB) for feature fusion. (a) The whole architecture of the
i-th MFIB. (b) The specific structure of graph convolution.
convolutions in channel-wise and node-wise directions. As a result, the output can be expressed as
ˆ
G= ((I−Ag)G)Wg,(10)
where I∈RN×Nrepresents the identity matrix, Ag∈RN×Nrepresents the adjacency matrix, and Wg
represents the update parameter. Agand Wgare all randomly initialized during training and optimized by235
gradient descent.
The fused graphs for semantic and edge features are further calculated as
ˆ
Gseg =Gseg +ˆ
G, (11)
ˆ
Gedge =Gedge +ˆ
G. (12)
Then, we remap ˆ
Gseg and ˆ
Gedge back to the original spatial domain through the projection operation v(·)
obtained in the mapping step to obtain fused features ˆ
Xseg and ˆ
Xedge as240
ˆ
Xseg =Xseg +v(Xseg;Wv)T⊗ˆ
Gseg,(13)
ˆ
Xedge =Xedge +v(Xedge;Wv)T⊗ˆ
Gedge.(14)
In the proposed method, the input of the first MFIB are exactly the semantic and edge features obtained
11
at the firs stage in the semantic segmentation module and the edge detection module, respectively. For the
last three MFIBs, the fused features obtained by the previous block is added into the corresponding original245
features to generate the input, as shown in Fig. 5(a).
3.5. Loss Function
The proposed segmentation network is trained by three loss functions, Lseg , Ledge , Lf usion. The Lseg is
the loss of the semantic segmentation module for learning semantic features. The BCEDiceLoss, which is a
combination of the binary cross-entropy (BCE) loss and the dice loss [7], is used to define Lseg as250
Lseg =X0.5·(−ylog (ˆy)−(1 −y) log (1 −ˆy)) + 1−2|y∩ˆy|
|y|+|ˆy|,(15)
where yrepresents GT and ˆyrepresents the prediction result.
The Ledge is the loss of the edge detection module for learning edge features. For the edge detection
problem, the class imbalance problem is important because most samples are negative. To address this
problem, the Ledge is defined by combining the edge loss presented in [38] and the dice loss as
Ledge = 0.5·LCRF +LDice,(16)
where the definition of LCRF is given as follows255
Lj
i=
α·log 1−ˆyj
iˆyi= 0
0 0 <ˆyi< η,
β·log ˆyj
iother
(17)
where ˆyj
irepresents the predicted value of the i-th pixel of the j-th edge map, and ηis a pre-defined threshold.
This means that if a pixel marked as positive is by less than ηinterpreters, it is discarded when the loss
is calculated and a positive sample is not considered. βis the number of percentages divided according to
negative samples. α=λ·(1 −β), where λis a hyperparameter for balancing positive and negative samples.
The Lfusion is the loss of the feature fusion module. It is defined as260
Lfusion =L′
seg +γ·L′
edge,(18)
where γis the weight parameter. L′
seg and L′
edge have the same definitions as Lseg and Ledge, but using the
predictions of the feature fusion module for calculation.
In this paper, we first train the semantic segmentation module and edge detection module using Lseg
and Ledge, respectively. Then, the entire model is further trained using Lf usion.
12
4. Experiments265
4.1. Dataset and Implementation Details
In the experiments, the training and testing datasets are all from BraTS2018, BraTS2019 and BraTS2020
benchmarks [43–45]. As an important public dataset for multimodal brain tumor segmentation, BraTS is
used in the annual MICCAI brain tumor segmentation challenge and widely adopted in the study of this
topic. The dataset is added, deleted or replaced in each year’s competition to enrich its scale. BraTS2018,270
2019, and 2020 have 285, 335, and 369 annotated brain tumor samples for model training, respectively.
Each case has MRI scans of four different modalities (Flair, T1, T1ce and T2) and are labeled by domain
experts. The labels contain four classes: background, NCR/NET, ED and ET. The evaluation is based on
three different brain tumor regions: Whole Tumor (WT = NCR/NET + ED + ET), Tumor Core (TC =
NCR/NET + ET) and Enhancing Tumor (ET). In this paper, we adopt two most commonly used metrics in275
medical image segmentation for performance assessment, which are the Dice Score and the %95 Hausdorff
Score (HD).
In the preprocessing stage, the size of each scan is 240 ×240 ×155. In this paper, the scans of all
modalities are sliced, and the size of each slice is 240 ×240. For the semantic segmentation module, all the
four modalities are used as the input. For the edge detection module, the input consists of the FLAIR and280
T1ce modalities. In addition, the popular z-score normalization is performed on the raw data to resolve
inconsistencies in image contrast under different modalities.
All the programs were implemented under the PyTorch framework. The training process is conducted
on four Tesla P100 GPUs. The optimizer used in the experiments is Adam [46]. The momentum is set to
0.9. The initial learning rate, weight decay and batch size are set to 1e-3, 1e-5, and 16, respectively.285
4.2. Parameter Analysis
In Section 3.5, several several adjustable parameters such as η,βand γare involved in the loss functions.
The parameters ηand βare set to the default values 0.3 and 1.1 according to [?]. In this subsection, we
mainly analyze the effect of the parameter γin Eq. (18) on the segmentation performance obtained by the
proposed method. The function of the parameter γis to balance the semantic segmentation loss and the290
edge detection loss to ensure both of them have sufficient contributions. In this experiment, we set γto a
set of values including 1, 0.5, 0.1 and 0.05 to study its impact. The BraTS 2018 benchmark is used in this
experiment. The corresponding evaluation results are shown in Table 1. It can be seen from the results
that the performance on HD tends to be better when γincreases. This is because the edge information is
more concerned when γis larger, leading to higher accuracy in terms of the tumor boundaries, while the the295
metric HD is sensitive to that. For the metric Dice, the proposed method tends to obtain higher performance
when a relatively smaller γis used, which indicates that the semantic features may be more important in
13
terms of this metric. Based on the above observations, we set γto 0.1 by default in our method to achieve
a good trade-off between the performances on two metrics.
Table 1: The segmentation performance of the proposed method using different values of the parameter γin Eq. (18).
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
γ=1 90.15 3.364 87.91 4.615 81.37 3.327 86.48 3.769
γ=0.5 90.85 3.720 88.08 4.569 81.62 3.615 86.85 3.968
γ=0.1 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
γ=0.05 90.63 4.419 88.14 5.545 81.41 4.289 86.72 4.289
4.3. Comparison with Other Methods300
To verify the superiority of the proposed method for brain tumor segmentation, several state-of-the-art
segmentation methods that have been tested on the BraTS2018-2020 benchmarks are used for comparison,
which include 2D or 3D CNN-based segmentation methods [3, 13, 14, 27, 28, 47–50], Transformer-based
segmentation methods [10, 31] and the methods that focus on multimodal feature fusion [17, 18, 51]. A
brief description of these methods is given in Table 2. Since the source code of many existing brain tumor305
segmentation methods was not released, and to avoid the bias introduced in model re-training, we directly
refer to related publications to obtain the evaluation results of the corresponding methods, which is a
commonly used manner in the study of brain tumor segmentation. The evaluation results of different
methods on BraTS2018, BraTS2019 and BraTS2020 benchmarks are listed in Table 3, Table 4 and Table
5, respectively. The best-performed values are indicated in bold. The corresponding results are visualized310
for better comparison in Fig. 6 and Fig. 7, which illustrate the performance of different brain tumor
segmentation methods on two metrics Dice and HD, respectively. The best-performed method in each case
is marked by a star on the corresponding bar.
According to the results reported in the above Tables and Figures, the proposed method achieves more
competitive performance when compared with other methods. Specifically, regarding the average Dice score,315
the proposed method achieves 86.71%, 88.22% and 87.95% on the BraTS2018-2020 benchmarks, which out-
performs other reference methods by 0.58% to 7.81%, 0.42% to 8.95%, and 2.89% to 4.93%, respectively.
Compared with TransBTS, which jointly uses Transformer and U-Net for semantic segmentation, the pro-
posed method achieves better results in all cases with clear advantages, and the improvement for the tumor
core is most significant. Additionally, compared with the latest RFNet method that considers multimodal320
feature fusion, the proposed method achieves obvious improvement for both tumor core and enhancing tumor
regions.
Fig. 8 shows the visual effect comparison of the brain tumor segmentation results obtained by different
14
Figure 6: Performance comparison of different brain tumor segmentation methods on the metric Dice. The best-performed
method in each case is marked by a star.
Figure 7: Performance comparison of different brain tumor segmentation methods on the metric HD. The best-performed
method in each case is marked by a star.
15
Input Image U-Net++ GT
Proposed
U-Net TransUNet
CENET
Figure 8: Visual effect comparison of brain tumor segmentation results obtained by different methods. The green, yellow and
red indicate ED, ET and NCR/NET regions, respectively.
Input Image U-Net U-Net++ TransUNet Proposed GT
WT_HD: 1.414WT_HD: 2.236WT_HD: 2.828WT_HD: 3.317
WT_HD: 1.414
WT_HD: 3.162WT_HD: 2.000WT_HD: 2.828
WT_HD: 2.828 WT_HD: 2.646 WT_HD: 3.000 WT_HD: 1.414
Figure 9: Performance comparison of different segmentation methods in terms of tumor boundary accuracy.
16
Table 2: A brief description of the methods used for performance comparison in our experiments.
Method Brief description
Myronenko[3] Proposes a U-Net-based segmentation framework by adding a VAE branch to reg-
ularize feature learning.
NoNewNet[13] Designs an improved U-Net architecture for segmentation.
Attention Unet[27] Proposes an attention mechanism based on U-Net for segmentation.
U-Net++[28] Adds a series of dense skip connections to the U-Net for segmentation.
N3D[14] Proposes a 3D U-Net for brain tumor segmentation.
Z. Jiang[49] Proposes an end-to-end cascading U-Net architecture for segmentation.
CENET[47] Designs a context extractor to generate more advanced semantic feature maps.
HNF-Net[50] Proposes a 3D high-resolution and non-local feature network for segmentation.
T. Zhou[18] Designs four independent encoding paths to extract features from four modalities
and then fuse them.
D. Zhang[17] Proposes a task-structured brain tumor segmentation network by considering mul-
timodal fusion.
TransBTS[10] Proposes an encoder-decoder structure consisting of Transformer and U-Net for
segmentation.
RFNet[51] Proposes a region-aware fusion network that exploits different combinations of mul-
timodal data.
TransUnet[31] Proposes a universal segmentation framework by combining Transformer and U-
Net.
Point-UNet[48] Designs a U-Net to perform a fine-class segmentation of the input point cloud.
Table 3: Objective evaluation results of different brain tumor segmentation methods on the BraTS2018 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
Myronenko[3] 90.40 4.483 85.90 8.278 81.40 3.805 85.90 5.500
NoNewNet[13] 90.80 4.790 84.32 8.160 79.59 3.120 84.90 5.357
U-Net++[28] 88.96 5.327 84.65 8.535 79.49 4.285 84.36 6.049
CENET[47] 89.53 5.271 84.31 8.493 79.95 4.379 84.60 6.193
D. Zhang[17] 89.60 5.733 82.40 9.270 78.20 3.567 83.40 6.190
TransUnet[31] 90.25 4.390 87.19 5.539 80.41 3.731 85.95 4.553
Point-UNet[48] 90.55 - 87.09 - 80.76 - 86.13 6.010
Proposed 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
methods. By referring to the ground truth (GT), the proposed method achieves more accurate segmentation
results and especially for the tumor edges than other methods, which demonstrates the effectiveness of the325
edge features extracted for segmentation.
Fig. 9 illustrates an example to compare the performance of different segmentation methods in terms of
tumor boundary accuracy. As mentioned above, the metric HD is more sensitive to the boundary shape, so
the corresponding HD scores of whole tumor are provided as well. Among all the methods, the proposed
one obtains the best results in terms of both the HD scores and the visual effect. These results further show330
17
Table 4: Objective evaluation results of different brain tumor segmentation methods on the BraTS2019 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
Attention Unet[27] 88.81 7.756 77.20 8.258 75.96 5.202 80.66 7.072
U-Net++[28] 89.67 6.345 87.13 5.521 80.25 3.313 85.68 5.060
Z. Jiang[49] 90.94 4.263 86.47 5.439 80.21 3.146 85.87 4.283
N3D[14] 91.60 6.547 88.80 6.219 83.00 3.543 87.80 5.436
HNF-Net[50] 91.11 4.136 86.40 5.250 80.96 3.490 86.16 4.292
T. Zhou[18] 89.70 6.700 77.50 9.300 70.60 7.400 79.27 7.800
TransBTS[10] 90.00 5.644 81.94 6.049 78.93 3.736 83.62 5.143
Proposed 91.58 3.866 89.24 5.118 83.84 3.080 88.22 4.021
Table 5: Objective evaluation results of different brain tumor segmentation methods on the BraTS2020 benchmark.
BraTS2020 WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
U-Net++[28] 89.77 6.299 85.57 5.483 79.83 4.328 85.06 5.370
Point-UNet[48] 89.67 - 82.97 - 76.43 - 83.02 8.260
TransBTS[10] 90.09 4.964 81.73 9.769 78.73 17.947 83.52 10.893
RFNet[51] 91.11 - 85.21 - 78.00 - 84.77 -
Proposed 91.03 4.719 88.22 5.985 84.61 3.051 87.95 4.585
that edge features can benefit the brain tumor segmentation task.
4.4. Ablation Study
To further verify the effectiveness of the main components including the shifted patch tokenization
strategy, the edge detection module and the MFIB in the feature fusion module that are designed in our
method, a ablation study is conducted in this subsection.335
Table 6: Objective evaluation results for the ablation study on the BraTS2018 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
SwinTrans 88.97 6.276 85.72 6.563 80.31 4.364 85.00 5.734
SwinTrans+SPD 89.00 5.720 85.95 6.453 80.62 4.338 85.19 5.504
SwinTrans+SPD+ED 89.93 4.259 87.24 5.398 81.19 3.728 85.95 4.462
Proposed 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
In this experiment, we use the standard Swin Transformer [11] as the baseline model, and then adds
different components one by one to validate their effect. Specifically, we mainly compare the performance
of the following four models:
18
SwinTrans SwinTrans+SPD SwinTrans+SPD+ED
Input Image
Completed Model
Groud Truth
Figure 10: Visual effect comparison of segmentation results obtained by different models in the ablation study.
19
-SwinTrans: Just using the Swin Transformer for brain tumor segmentation via a well pre-trained
model. This is the baseline model.340
-SwinTrans+SPD: Introducing the shifted patch tokenization strategy into the Swin Transformer for
segmentation without using pre-training.
-SwinTrans+SPD+ED: Further adding the edge detection module based on the above model. This
model has similar framework to the proposed one, but it just simply concatenate the semantic and
edge features for fusion, instead of using the proposed MFIB.345
-Completed Model: The complete model (i.e., SwinTrans+SPD+ED+MFIB) proposed in this paper.
Therefore, the comparison between SwinTrans and SwinTrans+SPD is used to demonstrate the effective-
ness of the shifted patch tokenization strategy adopted in the Swin Transformer. The comparison between
SwinTrans+SPD and SwinTrans+SPD+ED can validate the effect of the edge detection module (please
kindly note that the ED module cannot be individually used without the semantic segmentation module).350
The comparison between SwinTrans+SP+ED and Completed Model is used to show the effectiveness of the
designed MFIB for feature fusion.
Table 6 lists the objective performance of different models. We can see that the the each of the above
components leads to some improvements of the segmentation results. Among them, the effect of adding
edge detection module and using MFIB for feature fusion is more obvious.355
The visual effect comparison of segmentation results obtained by different models in the ablation study
is shown in Fig. 10. Some interesting observations include: 1) After adding the shifted patch, partial
unnecessary disturbance noise is eliminated. 2) After adding the edge detection module, the segmentation
accuracy of tumor edges is obviously higher when compared with the GT. 3) After adding the MFIB for
feature fusion, the tumor edges are visually more natural than the simple concatenating manner.360
5. Conclusion
This paper proposes a novel deep learning-based brain tumor segmentation method by jointly utilizing
deep semantics and edge information in multimodal MRI. To achieve this target, three functional modules are
designed. Specifically, we present a semantic segmentation module based on an improved Swin Transformer
by introducing the shifted patch tokenization strategy for better training. In addition, a CNN-based edge365
detection module is designed to extract edge features from the input MRI scans. Finally, we present a
feature fusion module by designing a multi-feature inference block based on graph convolution to fuse the
deep semantic edges and specific edge features. Experimental results demonstrate the effectiveness of the
key components designed in our method. Moreover, the proposed method achieves better performance
when compared with some state-of-the-art methods on the BraTS benchmarks. Future work may focus370
20
on exploring the feasibility of some other specific features for brain tumor segmentation and extending the
proposed approach to other semantic segmentation problems.
References
[1] F. Piccialli, V. Di Somma, F. Giampaolo, S. Cuomo, G. Fortino, A survey on deep learning in medicine: Why, how and
when?, Information Fusion 66 (2021) 111–137.375
[2] S. Mo, M. Cai, L. Lin, R. Tong, Q. Chen, F. Wang, H. Hu, Y. Iwamoto, X.-H. Han, Y.-W. Chen, Multimodal priors
guided segmentation of liver lesions in mri using mutual information based graph co-attention networks, in: International
Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2020, pp. 429–438.
[3] A. Myronenko, 3d mri brain tumor segmentation using autoencoder regularization, in: International MICCAI Brainlesion
Workshop, Springer, 2018, pp. 311–320.380
[4] A. Barredo Arrieta, N. D ˜
Aaz-Rodr ˜
Aguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez,
D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (xai): Concepts, taxonomies, oppor-
tunities and challenges toward responsible ai, Information Fusion 58 (2020) 82–115.
[5] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2015, pp. 3431–3440.385
[6] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International
Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
[7] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmen-
tation, in: 2016 fourth international conference on 3D vision (3DV), IEEE, 2016, pp. 565–571.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you390
need, Advances in neural information processing systems 30.
[9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint
arXiv:2010.11929.
[10] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, J. Li, Transbts: Multimodal brain tumor segmentation using transformer, in:395
International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2021, pp. 109–119.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using
shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[12] S. H. Lee, S. Lee, B. C. Song, Vision transformer for small-size datasets, arXiv preprint arXiv:2112.13492.
[13] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, K. H. Maier-Hein, No new-net, in: International MICCAI Brainlesion400
Workshop, Springer, 2018, pp. 234–244.
[14] F. Wang, R. Jiang, L. Zheng, C. Meng, B. Biswal, 3d u-net based brain tumor segmentation and survival days prediction,
in: International MICCAI Brainlesion Workshop, Springer, 2019, pp. 131–141.
[15] T. Cheng, X. Wang, L. Huang, W. Liu, Boundary-preserving mask r-cnn, in: European conference on computer vision,
Springer, 2020, pp. 660–676.405
[16] D. Acuna, A. Kar, S. Fidler, Devil is in the edges: Learning semantic boundaries from noisy annotations, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11075–11083.
[17] D. Zhang, G. Huang, Q. Zhang, J. Han, J. Han, Y. Wang, Y. Yu, Exploring task structure for brain tumor segmentation
from multi-modality mr images, IEEE Transactions on Image Processing 29 (2020) 9032–9043.
[18] T. Zhou, S. Ruan, Y. Guo, S. Canu, A multi-modality fusion network based on attention mechanism for brain tumor410
segmentation, in: 2020 IEEE 17th international symposium on biomedical imaging (ISBI), IEEE, 2020, pp. 377–380.
21
[19] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M. Jodoin, H. Larochelle, Brain tumor
segmentation with deep neural networks, Medical image analysis 35 (2017) 18–31.
[20] B. H. Menze, K. v. Leemput, D. Lashkari, M.-A. Weber, N. Ayache, P. Golland, A generative model for brain tumor
segmentation in multi-modal images, in: International Conference on Medical Image Computing and Computer-Assisted415
Intervention, Springer, 2010, pp. 151–159.
[21] M. P. Heinrich, O. Maier, H. Handels, Multi-modal multi-atlas segmentation using discrete optimisation and self-
similarities., VISCERAL Challenge@ ISBI 1390 (2015) 27.
[22] M. Goetz, C. Weber, J. Bloecher, B. Stieltjes, H.-P. Meinzer, K. Maier-Hein, Extremely randomized trees based brain
tumor segmentation, Proceeding of BRATS challenge-MICCAI (2014) 006–011.420
[23] N. K. Subbanna, D. Precup, D. L. Collins, T. Arbel, Hierarchical probabilistic gabor and mrf segmentation of brain
tumours in mri volumes, in: International conference on medical image computing and computer-assisted intervention,
Springer, 2013, pp. 751–758.
[24] D. Zikic, B. Glocker, E. Konukoglu, A. Criminisi, C. Demiralp, J. Shotton, O. M. Thomas, T. Das, R. Jena, S. J. Price,
Decision forests for tissue-specific segmentation of high-grade gliomas in multi-channel mr, in: International Conference425
on Medical Image Computing and Computer-Assisted Intervention, Springer, 2012, pp. 369–376.
[25] W. Wu, A. Y. Chen, L. Zhao, J. J. Corso, Brain tumor detection and segmentation in a crf (conditional random fields)
framework with pixel-pairwise affinity and superpixel-level features, International journal of computer assisted radiology
and surgery 9 (2) (2014) 241–253.
[26] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient430
multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, Medical image analysis 36 (2017) 61–78.
[27] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz,
et al., Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999.
[28] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, Unet++: Redesigning skip connections to exploit multiscale features
in image segmentation, IEEE transactions on medical imaging 39 (6) (2019) 1856–1867.435
[29] Y. Liu, F. Mu, Y. Shi, X. Chen, Sf-net: A multi-task model for brain tumor segmentation in multimodal mri via image
fusion, IEEE Signal Processing Letters 29 (2022) 1799–1803.
[30] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-
based biomedical image segmentation, Nature methods 18 (2) (2021) 203–211.
[31] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, Y. Zhou, Transunet: Transformers make strong440
encoders for medical image segmentation, arXiv preprint arXiv:2102.04306.
[32] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, V. M. Patel, Medical transformer: Gated axial-attention for medical image
segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2021, pp. 36–46.
[33] S. Pereira, A. Pinto, V. Alves, C. A. Silva, Brain tumor segmentation using convolutional neural networks in mri images,445
IEEE transactions on medical imaging 35 (5) (2016) 1240–1251.
[34] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmentation using cascaded anisotropic convolu-
tional neural networks, in: International MICCAI brainlesion workshop, Springer, 2017, pp. 178–190.
[35] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, I. B. Ayed, Hyperdense-net: a hyper-densely connected cnn
for multi-modal image segmentation, IEEE transactions on medical imaging 38 (5) (2018) 1116–1126.450
[36] Y. Liu, F. Mu, Y. Shi, J. Cheng, C. Li, X. Chen, Brain tumor segmentation in multimodal mri via pixel-level and
feature-level image fusion, Frontiers in Neuroscience 16 (2022) 1000587.
[37] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, Z. He, Modality-aware mutual learning for multi-modal medical
image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention,
22
Springer, 2021, pp. 589–599.455
[38] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, X. Bai, Richer convolutional features for edge detection, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp. 3000–3009.
[39] X. Chen, C. Dong, J. Ji, J. Cao, X. Li, Image manipulation detection by multi-view multi-scale supervision, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14185–14193.
[40] L. Jiao, J. Chen, F. Liu, S. Yang, C. You, X. Liu, L. Li, B. Hou, Graph representation learning meets computer vision: A460
survey, IEEE Transactions on Artificial Intelligence.
[41] Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, Y. Kalantidis, Graph-based global reasoning networks, in: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 433–442.
[42] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907.
[43] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest,465
et al., The multimodal brain tumor image segmentation benchmark (brats), IEEE transactions on medical imaging 34 (10)
(2014) 1993–2024.
[44] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, C. Davatzikos, Advancing
the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features, Scientific data 4 (1)
(2017) 1–13.470
[45] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki,
et al., Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall
survival prediction in the brats challenge, arXiv preprint arXiv:1811.02629.
[46] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
[47] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, J. Liu, Ce-net: Context encoder network for 2d475
medical image segmentation, IEEE transactions on medical imaging 38 (10) (2019) 2281–2292.
[48] N.-V. Ho, T. Nguyen, G.-H. Diep, N. Le, B.-S. Hua, Point-unet: A context-aware point-based neural network for volumetric
segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2021, pp. 644–655.
[49] Z. Jiang, C. Ding, M. Liu, D. Tao, Two-stage cascaded u-net: 1st place solution to brats challenge 2019 segmentation480
task, in: International MICCAI brainlesion workshop, Springer, 2019, pp. 231–241.
[50] H. Jia, Y. Xia, W. Cai, H. Huang, Learning high-resolution and efficient non-local features for brain glioma segmentation
in mr images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2020, pp. 480–490.
[51] Y. Ding, X. Yu, Y. Yang, Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation, in:485
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3975–3984.
23