ArticlePDF Available

Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI

Authors:

Abstract and Figures

Brain tumor segmentation in multimodal MRI has great significance in clinical diagnosis and treatment. The utilization of multimodal information plays a crucial role in brain tumor segmentation. However, most existing methods focus on the extraction and selection of deep semantic features, while ignoring some features with specific meaning and importance to the segmentation problem. In this paper, we propose a brain tumor segmentation method based on the fusion of deep semantics and edge information in multimodal MRI, aiming to achieve a more sufficient utilization of multimodal information for accurate segmentation. The proposed method mainly consists of a semantic segmentation module, an edge detection module and a feature fusion module. In the semantic segmentation module, the Swin Transformer is adopted to extract semantic features and a shifted patch tokenization strategy is introduced for better training. The edge detection module is designed based on convolutional neural networks (CNNs) and an edge spatial attention block (ESAB) is presented for feature enhancement. The feature fusion module aims to fuse the extracted semantic and edge features, and we design a multi-feature inference block (MFIB) based on graph convolution to perform feature reasoning and information dissemination for effective feature fusion. The proposed method is validated on the popular BraTS benchmarks. The experimental results verify that the proposed method outperforms a number of state-of-the-art brain tumor segmentation methods. The source code of the proposed method is available at https://github.com/HXY-99/brats.
Content may be subject to copyright.
Brain Tumor Segmentation Based on the Fusion of Deep Semantics and
Edge Information in Multimodal MRI
Zhiqin Zhua, Xianyu Hea, Guanqiu Qib, Yuanyuan Lia, Baisen Congc, Yu Liud,
aCollege of Automation, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
bComputer Information Systems Department, State University of New York at Buffalo State, Buffalo, NY 14222, USA
cDiagnostics Digital, DH(Shanghai) Diagnostics Co, Ltd, a Danaher company, Shanghai, 200335, China
dDepartment of Biomedical Engineering, Hefei University of Technology, Hefei 230009, China
Abstract
Brain tumor segmentation in multimodal MRI has great significance in clinical diagnosis and treatment.
The utilization of multimodal information plays a crucial role in brain tumor segmentation. However, most
existing methods focus on the extraction and selection of deep semantic features, while ignoring some features
with specific meaning and importance to the segmentation problem. In this paper, we propose a brain tumor
segmentation method based on the fusion of deep semantics and edge information in multimodal MRI, aiming
to achieve a more sufficient utilization of multimodal information for accurate segmentation. The proposed
method mainly consists of a semantic segmentation module, an edge detection module and a feature fusion
module. In the semantic segmentation module, the Swin Transformer is adopted to extract semantic features
and a shifted patch tokenization strategy is introduced for better training. The edge detection module is
designed based on convolutional neural networks (CNNs) and an edge spatial attention block (ESAB) is
presented for feature enhancement. The feature fusion module aims to fuse the extracted semantic and edge
features, and we design a multi-feature inference block (MFIB) based on graph convolution to perform feature
reasoning and information dissemination for effective feature fusion. The proposed method is validated on
the popular BraTS benchmarks. The experimental results verify that the proposed method outperforms
a number of state-of-the-art brain tumor segmentation methods. The source code of the proposed
method is available at https://github.com/HXY-99/brats.
Keywords: Brain tumor segmentation, Transformer, convolutional neural networks, edge feature, feature
fusion
1. Introduction
Medical image segmentation is an important topic in the community of medical image processing. Among
them, brain tumor segmentation aims to localize multiple types of tumor regions from images, which is of
Corresponding author
Email addresses: zhuzq@cqupt.edu.cn (Zhiqin Zhu), s210301012@stu.cqupt.edu.cn (Xianyu He), qig@buffalostate.edu
(Guanqiu Qi), liyy@cqupt.edu.cn (Yuanyuan Li), bcong@dhdiagnostics.com (Baisen Cong), yuliu@hfut.edu.cn (Yu Liu)
Preprint submitted to Information Fusion October 2, 2022
great significance to clinical practice [1]. Owing to the good capacity in providing high-resolution anatomic
structures for soft-tissues, magnetic resonance imaging (MRI) is mostly utilized in the diagnosis and treat-5
ment of brain tumor diseases. To obtain comprehensive information for accurate segmentation, multimodal
MRI scans with different imaging parameters are usually required in brain tumor segmentation. Commonly-
used modalities include fluid attenuation inversion recovery (FLAIR), T1-weighted (T1), contrast enhanced
T1-weighted (T1ce) and T2-weighted (T2). Images of different modalities capture different pathological
information and they can complement each other effectively [2], which plays a crucial role in segmenting10
multiple types of brain tumor regions such as edema (ED), necrosis and non-enhancing tumor (NCR/NET),
and enhancing tumor (ET). An example of multimodal MRI for brain tumor segmentation is shown in Fig.
1. For simplicity, only a slice is selected from the entire scan. Fig. 1(a) shows the ground truth (GT)
segmentation label provided by domain experts. The green, yellow and red indicate ED, ET and NCR/NET
regions, respectively. From Fig. 1(b)-(e), it can be seen that the characteristics of different modalities vary15
significantly. For example, the FLAIR modality can well capture the ED regions with distinct edges or
boundaries between the tumor and normal tissues, while the T1ce modality is more effective in detecting
the tumor core (i.e., the union of ET and NRC/NET) with high contrast [3].
Figure 1: An example of multimodal MRI for brain tumor segmentation. (a) The ground truth (GT) segmentation label
provided by domain experts (the green, yellow and red represent edema (ED), enhancing tumor (ET), and necrosis and non-
enhancing tumor (NCR/NET), respectively). (b) The FLAIR modality. (c) The T1 modality. (d) The T1ce modality. (e) The
T2 modality.
A variety of brain tumor segmentation approaches have been presented in the literature. In recent years,
deep learning-based methods have emerged as the mainstream in this field [4]. The most popular way is20
to adopt semantic segmentation-oriented convolutional neural networks (CNNs) such as fully convolutional
networks (FCNs) [5], U-Net [6] and V-Net [7] to segment brain tumors. The CNN-based models can well
capture local features in 2D or 3D spaces. However, CNNs are limited by the receptive field of convolutions,
leading to difficulty in characterizing the global dependencies of features, which is essentially an important
clue in semantic segmentation. To address this issue, Transformer-based models [8, 9] have been introduced25
into the study of brain tumor segmentation [10]. By establishing the connection between feature base
units (i.e., tokens) and adopting a self-attention mechanism, the Transformer-based models demonstrate
better capacity in capturing global contextual information. However, the input size of the standard vision
2
Transformer (ViT) [9] is fixed, which causes the problem of excessive computational cost for semantic
segmentation that requires pixel-wise dense prediction. By constructing hierarchical structure to obtain30
feature maps like CNNs, the Swin Transformer [11] well solves this problem and achieves efficient dense
prediction, which exhibits clear advantages for semantic segmentation problems. Nevertheless, the Swin
Transformer still suffers from the problem of low locality inductive bias, which means that it usually requires
a very large amount of training data to obtain satisfactory visual representation [12]. Since the dataset size
of most medical image analysis problems is typically very small, a pre-trained model is generally needed35
when utilizing ViT and its variants including Swin Transformer. However, an appropriate pre-trained model
is not always available in practice and performing pre-training by oneself is not an easy task.
In most existing brain tumor segmentation methods, the multimodal MRI scans are simply stacked
as the model input for semantic segmentation, which may cause the insufficient utilization of multimodal
information [13, 14]. In fact, the role of some modalities tends to be more significant in the segmentation task,40
as they contain more distinctive information. For instance, the FLAIR and T1ce modalities can effectively
capture the edges of multiple types of tumor regions such as ED and ET. The edge information is actually very
important with regard to brain tumor segmentation, as it not only helps to achieve better localization for the
tumors, but also benefits the boundary quality (e.g., sharpness and accuracy) of segmented regions [15, 16].
It is believed that the edge information could be a good complementary to the deep semantic information. It45
is worth noting that although some methods [2, 17, 18] consider the complementary features of multimodal
inputs and explore multimodal fusion accordingly, they mainly focus on the extraction and fusion of deep
semantic features without considering the importance of the specific edge features for segmentation.
In this paper, we propose a brain tumor segmentation method based on the fusion of deep semantics
and edge information in multimodal MRI, aiming to achieve a more sufficient utilization of multimodal50
information for accurate segmentation. Specifically, the proposed segmentation framework consists of three
main modules: semantic segmentation, edge detection and feature fusion. The semantic segmentation mod-
ule adopts the Swin Transformer as the backbone due to its advantages mentioned above. Moreover, we
introduce a shifted patch tokenization strategy [12] into the Swin Transformer to increase its locality in-
duction bias, so as to achieve easier training for small-size datasets. The edge detection module is designed55
based on CNNs to extract edge features from the FLAIR and T1ce modalities by considering their char-
acteristics. The feature fusion module is designed to fuse semantic features and edge features extracted
from MRI of different modalities. This module adopts graph convolution to structure different areas into
different vertices and collects similar semantic features and edge features under the same vertex. It realizes
the reasoning and dissemination of information between semantic features and edge features, leading to the60
improvement of feature fusion effect. Through the above designs, the proposed segmentation framework can
effectively extract and fuse deep semantic features and edge features in multimodal MRI, and experimental
results on BraTS benchmarks in 2018, 2019 and 2020 demonstrate its superior performance in brain tumor
3
segmentation.
In summary, the contributions of this paper are four-fold.65
1. The primary contribution of this paper is that we propose a deep learning-based brain tumor segmen-
tation method that simultaneously utilizes deep semantic features and specific edge features. To the
best of our knowledge, this manner is different from existing works that mostly focus on the extraction
and fusion of semantic features from multimodal MRI. To achieve this target, three modules including
a semantic segmentation module, an edge detection module, and a feature fusion module are designed,70
leading to the following three technical contributions.
2. We present a Swin Transformer-based semantic segmentation module to extract semantic features
for brain tumor segmentation. In particular, to address the problem caused by the lack of locality
inductive bias, we introduce a shifted patch tokenization strategy into the Swin Transformer, leading
to easier training for small-size datasets.75
3. We present a CNN-based edge detection module to extract edge features from the FLAIR and T1ce
modalities. In this module, an edge spatial attention block (ESAB) using Sobel operator is designed
to enhance edge features, which are extracted in a progressive manner.
4. We present a feature fusion module to fuse the extracted deep semantic features and edge features.
Specifically, a multi-feature inference block (MFIB) based on graph convolution is designed to achieve80
effective feature fusion.
2. Related Work
Various methods for brain tumor segmentation have been proposed in recent years. These methods
can be broadly classified into two categories: the generative model-based methods and the discriminative
model-based methods [19]. The generative model-based methods focus on the appearance characteristics85
of tumorous and healthy tissues, thus requiring related domain-specific prior information, which is usually
obtained through probabilistic image atlases. Menze et al. [20] augmented a probabilistic atlas of healthy
tissue priors with a latent atlas of the lesion and derive the estimation algorithm to extract tumor boundaries
and the latent atlas from the image data. Heinrich et al. [21] employed discrete optimization and self-
similarity for multimodal medical image segmentation under discrete medical image registration framework.90
The discriminative model-based methods regard tumor segmentation as a classification problem to determine
the property of voxels. Owing to the rapid development of machine learning techniques, the discriminative
model-based methods have gradually become the main trend in this field. Early methods in this category
4
mainly rely on hand-crafted features such as local histograms [22] and texture features [23], and then employ
discriminative models such as decision tree [24] and conditional random field [25] for classification.95
In the past few years, deep learning has rapidly become the mainstream in the study of brain tumor
segmentation. Some early approaches adopt a patch-based classification strategy and design CNNs to
predict the class of the center voxel of a 2D or 3D patch [19, 26]. However, it is difficult for such patch-
based methods to fully consider the correlation among neighboring patches within a relatively large range.
To address this issue, end-to-end semantic segmentation models such as U-Net [6], attention U-Net [27]100
and U-Net++ [28] have become more popular in brain tumor segmentation. Myronenko [3] proposed a
segmentation network that adds a variational auto-encoder branch to reconstruct the input image for more
effective feature learning. Liu et al. [29] introduced pixel-level image fusion as an auxiliary task to regularize
feature learning and presented a multi-task model for brain tumor segmentation. Isensee et al. [30] proposed
an adaptive framework based on 2D U-Net, 3D U-Net and U-Net Cascade. The framework automatically105
adjusts all hyperparameters without human intervention.
Although the CNN-based methods have achieved great success in brain tumor segmentation, it is known
that CNNs suffer from the limitation of capturing global contextual information, which is a crucial clue
for semantic segmentation. To solve this problem, the Transformer-based methods have gained increasing
attention in the field of medical image segmentation with some representative models being proposed, such as110
the TransUNet [31] that combines Transformer and U-Net, the MedT [32] that presents gated axial-attention
for segmentation. In the study of brain tumor segmentation, the TransBTS proposed by Wang et al. [10] is
the first work that uses Transformer for segmentation and achieves good performance. The above methods
are based on the standard ViT [9], in which the input size should be fixed, leading to high computational
complexity for dense prediction problems such as semantic segmentation. The Swin Transformer [11], which115
adopts hierarchical structure to obtain feature maps like CNNs, can effectively alleviate this problem. This
improvement greatly enhance the potential of Transformer models for semantic segmentation, and motivates
us to adopt the Swin Transformer for brain tumor segmentation in this work. Nevertheless, similar to the
standard ViT, the Swin Transformer still suffers from a defect, i.e., the lack of locality inductive bias [12].
As a consequence, it is pretty hard to utilize the Swin Transformer for small-size datasets without pre-120
training, leading to some inconvenience when applying it to medical image analysis tasks including brain
tumor segmentation, since an appropriate pre-trained model is not always available in practice. To tackle
this problem, we introduce a shifted patch tokenization strategy [12] into the Swin Transformer for brain
tumor segmentation, so that the model can be trained from scratch.
In order to obtain more accurate segmentation results, the use of multimodal MRI data has become an125
interesting topic in brain tumor segmentation. Most existing methods simply adopt an multi-channel input
by stacking multimodal MRI scans and don’t fully consider the difference in terms of their importance to
brain tumor segmentation. Pereira et al. [33] designed a convolutional network for automatic brain tumor
5
segmentation using a four-channel format for multimodal images. Wang et al. [34] achieved the segmentation
of different brain tumor regions by constructing three cascaded networks. In order to use multimodal130
information more effectively, some feature fusion and selection approaches have appeared based on specific
architectures such as attention mechanism. Dolz et al. [35] extended dense connections to multimodal image
segmentation based on DenseNets. Each modality to be input into the network individually is considered
as a branch and dense connections between each branch are used to fuse features from different modalities.
Liu et al. [36] proposed an attention-based modality selection feature fusion module for multimodal feature135
refinement to address the difference among multiple modalities for a given segmentation target. Zhang
et al. [37] used FCN to extract features from images of different modalities, and designed a modality-
aware module for more efficient information exchange across different modalities. Mo et al. [2] divided the
different modalities into main modality and auxiliary modality, and applied the attention mechanism for
feature fusion.140
Although the above-mentioned methods make good efforts to utilize multimodal MRI information, they
all focus on the extraction and selection of deep semantic features, while ignoring some features with specific
meaning and importance to the segmentation problem. In this paper, in addition to the semantic features, we
further concentrate on extracting the edge information of multiple types of tumor regions from some relevant
modalities including FLAIR and T1ce, which are of great significance to improve segmentation quality, since145
the edge information is helpful to obtaining more accurate locations and boundaries of tumors. The extracted
edge features are merged with the semantic features, aiming to utilizes multimodal MRI more effectively
and improve the segmentation accuracy accordingly.
3. The Proposed Method
3.1. Overview150
The framework of the proposed brain tumor segmentation method is shown in Fig. 2. It is mainly
composed of a semantic segmentation module, an edge detection module and a feature fusion module.
The semantic module adopts an improved Swin Transformer block to extract deep semantic features from
multimodal MRI scans including FLAIR, T1, T1ce and T2. The edge detection module aims to extract edge
features by employing a convolutional network as the backbone and designing edge spatial attention blocks155
(ESABs) for feature enhancement. Considering the modal characteristics of MRI modalities, only FLAIR
and T1ce are selected as the input of the edge detection module. The feature fusion module consists of several
multi-feature inference blocks (MFIBs), aiming to fuse the semantic features obtained from the semantic
segmentation module and the edge features obtained from the edge module at multiple levels. To reconstruct
the segmentation result, a successive expanding decoder that is widely adopted in the U-Net-like architectures160
is employed . For the edge detection task, the result is directly obtained by bilinear interpolation. For the
6
Figure 2: The framework of the proposed brain tumor segmentation method. It mainly consists of three modules: a semantic
segmentation module, an edge detection module and a feature fusion module.
feature fusion model, the output includes both an edge detection result and a segmentation result. To train
the network, the semantic segmentation module and edge detection module are first trained individually.
Then, the output of the feature fusion module is used to supervise the training of the entire model.
3.2. Semantic Segmentation Module165
In the semantic segmentation module, we apply the Swin Transformer with an improved patch merging
approach to extract semantic features for the segmentation tasks. The Swin Transformer consists of four
stages, as shown in Fig. 2. For the last three, the original Swin Transformer blocks with a patch merging
step is adopted [11]. For the first one, the steps of patch partition and linear embedding are required prior
to the Swin Transformer blocks. Let XRH×W×Cdenote the input, where H×Wrepresents the size170
of the input feature map and Crepresents the number of channels. The input image is first divided into
Mpatches of size P×P, and then each patch is reshaped into a 1D vector xpRN×(P2·C). Next, these
patches are flattened and mapped to Ddimension through a trainable linear projection ER(P2·C)×Dto
obtain the visual token zinvolving a learnable position variable Epos R(P2·C)×Das
z=xpE+Epos,(1)
where zis input to the Transformer Block as an embedding sequence. Since the Transformer directly divides175
the features, the local information in the patch is difficult to capture, thereby making the Transformer lack
the ability of locality inductive bias.
7
Figure 3: The architecture of a Swin Transformer block with the shifted patch tokenization strategy. Both W-MSA and SW-
MSA are multi-head attention, which represent the regular window and the shifted window, respectively.
To address this problem, as shown in Fig. 3, this paper introduces a shifted patch tokenization strategy
[12], which can embed more spatial information into the visual token, increasing the locality induction ability
of the Transformer to avoid extensive pre-training. Specifically, the input is shifted before patch partition180
by a patch size from four directions, and then the original input and its shifted versions are spliced. Finally,
patch partition and linear embedding are performed.
Each stage in the Swin Transformer consists of Lblocks consisting of multi-head self attention (MSA)
and multilayer perceptron (MLP). The structure of each block is shown in Fig. 3. The layer normalization
(LN) is firstly applied and residual connections are used. The MLP contains two fully connected layers with185
GELU. The above process can be expressed as
ˆzl= WMSA LN zl1+zl1,(2)
zl= MLP LN ˆzl+ ˆzl,(3)
ˆzl+1 = SWMSA LN zl+zl,(4)
zl+1 = MLP LN ˆzl+1 + ˆzl+1,(5)
where ˆzland zlrepresent the output features of the W-MSA or SW-MSA module and MLP module at the190
l-th block, where W-MSA and SW-MSA denote window-based multi-head self-attention using regular and
shifted window partitioning configurations, respectively. At the end of the Transformer layer, the output zl
goes through a LN layer to obtain the final output z:
z= LN zl.(6)
To generate a hierarchical representation, the Swin Transformer reduces the number of tokens and
8
Figure 4: The architecture of the edge spatial attention block (ESAB) designed for edge feature enhancement.
increases the dimensionality through patch merging. In each patch merging step, the features of adjacent
2×2 patches are concatenated and the concatenated features are linearized to increase the dimensionality.
Specifically, the following output features are obtained after four stages.
X1
seg RH
2×W
2×128, X 2
seg RH
4×W
4×256
X3
seg RH
8×W
8×512, X 4
seg RH
16 ×W
16 ×1024
These features are subsequently input into the feature fusion module and fused with the output features
of the edge detection module to achieve better segmentation performance.195
3.3. Edge Detection Module
The segmentation performance can be improved by supplementing the edge information of brain tumors.
However, the edge features are shallow features. Directly using features of the last convolution block will
force the deep network to capture the shallow edge features, thus affecting the extraction performance of
the edge features. At the same time, the middle layers can also bring rich convolutional features about edge200
information [38, 39]. Therefore, it is necessary to utilize all the convolution layers to obtain richer features.
To this end, this paper designs an edge detection module to utilizes the features of multiple convolution
layers simultaneously. As shown in the edge detection of Fig. 2, after the image is input into the network,
features of different dimensions are extracted through 4 convolution blocks. The convolution block consists
of two 3×3 convolutional layers, a regularization layer and a 2 ×2 max-pooling layer. Specifically, when
the image to be detected is input to the edge detection module, the output features obtained after four
convolution blocks are given as
X1
edge RH
2×W
2×128, X 2
edge RH
4×W
4×256
X3
edge RH
8×W
8×512, X 4
edge RH
16 ×W
16 ×1024
In order to enhance the edge features, the features of each convolution block are refined by an edge
spatial attention block (ESAB). The architecture of the ESAB is shown in Fig. 4. The Sobel convolution
is used in the ESAB. The 1 ×1 convolution reduces the feature volume to a single channel map and obtain
9
the output feature. Furthermore, the output features of a certain layer are added to the output features of205
the next layer, leading to a progressive manner to obtain richer edge features. Finally, the output features
are interpolated to the original input size to reconstruct edge detection result.
3.4. Feature Fusion Module
The feature fusion module that consists of four multi-feature inference blocks (MFIBs) is designed to
fuse semantic features and edge features extracted by the above two modules. In recent years, graph-210
based applications have become more and more widespread and have been verified to be an effective way of
relational reasoning, which makes it a suitable tool to implement multi-feature fusion [40]. In this work, we
design the feature fusion module based on graph convolution by referring to a a recent work on graph-based
global reasoning [41]. The architecture of our MFIB is shown in Fig. 5(a).
The features obtained by the given semantic segmentation module and edge detection module are denoted215
as Xseg RH×W×Cand Xedge RH×W×C, respectively. To achieve better integration of semantics and
features, we map the input features from the spatial domain Xto the graph domain GRN×F, where N
represents the number of nodes in the graph and Frepresents the features contained in a node [41]. In this
way, pixels with similar features can be aggregated into a node as an anchor to generate a semantic-aware
graph feature. The feature fusion process is detailed as below. Let Xseg and Xedge denote the input semantic220
and edge features of a given MFIB, respectively. They are mapped to the graph domain to obtain Gseg and
Gedge through two convolutional layers as
Gseg =v(Xseg;Wv)w(Xseg ;Ww),(7)
Gedge =v(Xedge ;Wv)w(Xedge ;Ww),(8)
where v(·) represents the convolution operations used for graph projection and w(·) represents the convolu-
tion operations used for feature dimensionality reduction. Wvand Wwdenote the learnable kernels of v(·)225
and w(·), respectively. The symbol indicates matrix multiplication. More details of the above projection
can be found in [41].
After projection, in order to learn the relationship between the related node features of the semantic
graph and edge graph, the graph convolution [42] is adopted to learn edge weights corresponding to the
features of each node for reasoning on the fully connected graph. The input of the graph convolution unit230
Gis obtained by adding Gseg and Gedge as
G=Gseg +Gedge.(9)
The architecture of the graph convolution unit is shown in Fig. 5(b), which is implemented by two 1D
10
Figure 5: The architecture of the multi-feature inference block (MFIB) for feature fusion. (a) The whole architecture of the
i-th MFIB. (b) The specific structure of graph convolution.
convolutions in channel-wise and node-wise directions. As a result, the output can be expressed as
ˆ
G= ((IAg)G)Wg,(10)
where IRN×Nrepresents the identity matrix, AgRN×Nrepresents the adjacency matrix, and Wg
represents the update parameter. Agand Wgare all randomly initialized during training and optimized by235
gradient descent.
The fused graphs for semantic and edge features are further calculated as
ˆ
Gseg =Gseg +ˆ
G, (11)
ˆ
Gedge =Gedge +ˆ
G. (12)
Then, we remap ˆ
Gseg and ˆ
Gedge back to the original spatial domain through the projection operation v(·)
obtained in the mapping step to obtain fused features ˆ
Xseg and ˆ
Xedge as240
ˆ
Xseg =Xseg +v(Xseg;Wv)Tˆ
Gseg,(13)
ˆ
Xedge =Xedge +v(Xedge;Wv)Tˆ
Gedge.(14)
In the proposed method, the input of the first MFIB are exactly the semantic and edge features obtained
11
at the firs stage in the semantic segmentation module and the edge detection module, respectively. For the
last three MFIBs, the fused features obtained by the previous block is added into the corresponding original245
features to generate the input, as shown in Fig. 5(a).
3.5. Loss Function
The proposed segmentation network is trained by three loss functions, Lseg , Ledge , Lf usion. The Lseg is
the loss of the semantic segmentation module for learning semantic features. The BCEDiceLoss, which is a
combination of the binary cross-entropy (BCE) loss and the dice loss [7], is used to define Lseg as250
Lseg =X0.5·(ylog y)(1 y) log (1 ˆy)) + 12|yˆy|
|y|+|ˆy|,(15)
where yrepresents GT and ˆyrepresents the prediction result.
The Ledge is the loss of the edge detection module for learning edge features. For the edge detection
problem, the class imbalance problem is important because most samples are negative. To address this
problem, the Ledge is defined by combining the edge loss presented in [38] and the dice loss as
Ledge = 0.5·LCRF +LDice,(16)
where the definition of LCRF is given as follows255
Lj
i=
α·log 1ˆyj
iˆyi= 0
0 0 <ˆyi< η,
β·log ˆyj
iother
(17)
where ˆyj
irepresents the predicted value of the i-th pixel of the j-th edge map, and ηis a pre-defined threshold.
This means that if a pixel marked as positive is by less than ηinterpreters, it is discarded when the loss
is calculated and a positive sample is not considered. βis the number of percentages divided according to
negative samples. α=λ·(1 β), where λis a hyperparameter for balancing positive and negative samples.
The Lfusion is the loss of the feature fusion module. It is defined as260
Lfusion =L
seg +γ·L
edge,(18)
where γis the weight parameter. L
seg and L
edge have the same definitions as Lseg and Ledge, but using the
predictions of the feature fusion module for calculation.
In this paper, we first train the semantic segmentation module and edge detection module using Lseg
and Ledge, respectively. Then, the entire model is further trained using Lf usion.
12
4. Experiments265
4.1. Dataset and Implementation Details
In the experiments, the training and testing datasets are all from BraTS2018, BraTS2019 and BraTS2020
benchmarks [43–45]. As an important public dataset for multimodal brain tumor segmentation, BraTS is
used in the annual MICCAI brain tumor segmentation challenge and widely adopted in the study of this
topic. The dataset is added, deleted or replaced in each year’s competition to enrich its scale. BraTS2018,270
2019, and 2020 have 285, 335, and 369 annotated brain tumor samples for model training, respectively.
Each case has MRI scans of four different modalities (Flair, T1, T1ce and T2) and are labeled by domain
experts. The labels contain four classes: background, NCR/NET, ED and ET. The evaluation is based on
three different brain tumor regions: Whole Tumor (WT = NCR/NET + ED + ET), Tumor Core (TC =
NCR/NET + ET) and Enhancing Tumor (ET). In this paper, we adopt two most commonly used metrics in275
medical image segmentation for performance assessment, which are the Dice Score and the %95 Hausdorff
Score (HD).
In the preprocessing stage, the size of each scan is 240 ×240 ×155. In this paper, the scans of all
modalities are sliced, and the size of each slice is 240 ×240. For the semantic segmentation module, all the
four modalities are used as the input. For the edge detection module, the input consists of the FLAIR and280
T1ce modalities. In addition, the popular z-score normalization is performed on the raw data to resolve
inconsistencies in image contrast under different modalities.
All the programs were implemented under the PyTorch framework. The training process is conducted
on four Tesla P100 GPUs. The optimizer used in the experiments is Adam [46]. The momentum is set to
0.9. The initial learning rate, weight decay and batch size are set to 1e-3, 1e-5, and 16, respectively.285
4.2. Parameter Analysis
In Section 3.5, several several adjustable parameters such as η,βand γare involved in the loss functions.
The parameters ηand βare set to the default values 0.3 and 1.1 according to [?]. In this subsection, we
mainly analyze the effect of the parameter γin Eq. (18) on the segmentation performance obtained by the
proposed method. The function of the parameter γis to balance the semantic segmentation loss and the290
edge detection loss to ensure both of them have sufficient contributions. In this experiment, we set γto a
set of values including 1, 0.5, 0.1 and 0.05 to study its impact. The BraTS 2018 benchmark is used in this
experiment. The corresponding evaluation results are shown in Table 1. It can be seen from the results
that the performance on HD tends to be better when γincreases. This is because the edge information is
more concerned when γis larger, leading to higher accuracy in terms of the tumor boundaries, while the the295
metric HD is sensitive to that. For the metric Dice, the proposed method tends to obtain higher performance
when a relatively smaller γis used, which indicates that the semantic features may be more important in
13
terms of this metric. Based on the above observations, we set γto 0.1 by default in our method to achieve
a good trade-off between the performances on two metrics.
Table 1: The segmentation performance of the proposed method using different values of the parameter γin Eq. (18).
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
γ=1 90.15 3.364 87.91 4.615 81.37 3.327 86.48 3.769
γ=0.5 90.85 3.720 88.08 4.569 81.62 3.615 86.85 3.968
γ=0.1 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
γ=0.05 90.63 4.419 88.14 5.545 81.41 4.289 86.72 4.289
4.3. Comparison with Other Methods300
To verify the superiority of the proposed method for brain tumor segmentation, several state-of-the-art
segmentation methods that have been tested on the BraTS2018-2020 benchmarks are used for comparison,
which include 2D or 3D CNN-based segmentation methods [3, 13, 14, 27, 28, 4750], Transformer-based
segmentation methods [10, 31] and the methods that focus on multimodal feature fusion [17, 18, 51]. A
brief description of these methods is given in Table 2. Since the source code of many existing brain tumor305
segmentation methods was not released, and to avoid the bias introduced in model re-training, we directly
refer to related publications to obtain the evaluation results of the corresponding methods, which is a
commonly used manner in the study of brain tumor segmentation. The evaluation results of different
methods on BraTS2018, BraTS2019 and BraTS2020 benchmarks are listed in Table 3, Table 4 and Table
5, respectively. The best-performed values are indicated in bold. The corresponding results are visualized310
for better comparison in Fig. 6 and Fig. 7, which illustrate the performance of different brain tumor
segmentation methods on two metrics Dice and HD, respectively. The best-performed method in each case
is marked by a star on the corresponding bar.
According to the results reported in the above Tables and Figures, the proposed method achieves more
competitive performance when compared with other methods. Specifically, regarding the average Dice score,315
the proposed method achieves 86.71%, 88.22% and 87.95% on the BraTS2018-2020 benchmarks, which out-
performs other reference methods by 0.58% to 7.81%, 0.42% to 8.95%, and 2.89% to 4.93%, respectively.
Compared with TransBTS, which jointly uses Transformer and U-Net for semantic segmentation, the pro-
posed method achieves better results in all cases with clear advantages, and the improvement for the tumor
core is most significant. Additionally, compared with the latest RFNet method that considers multimodal320
feature fusion, the proposed method achieves obvious improvement for both tumor core and enhancing tumor
regions.
Fig. 8 shows the visual effect comparison of the brain tumor segmentation results obtained by different
14
Figure 6: Performance comparison of different brain tumor segmentation methods on the metric Dice. The best-performed
method in each case is marked by a star.
Figure 7: Performance comparison of different brain tumor segmentation methods on the metric HD. The best-performed
method in each case is marked by a star.
15
Input Image U-Net++ GT
Proposed
U-Net TransUNet
CENET
Figure 8: Visual effect comparison of brain tumor segmentation results obtained by different methods. The green, yellow and
red indicate ED, ET and NCR/NET regions, respectively.
Input Image U-Net U-Net++ TransUNet Proposed GT
WT_HD: 1.414WT_HD: 2.236WT_HD: 2.828WT_HD: 3.317
WT_HD: 1.414
WT_HD: 3.162WT_HD: 2.000WT_HD: 2.828
WT_HD: 2.828 WT_HD: 2.646 WT_HD: 3.000 WT_HD: 1.414
Figure 9: Performance comparison of different segmentation methods in terms of tumor boundary accuracy.
16
Table 2: A brief description of the methods used for performance comparison in our experiments.
Method Brief description
Myronenko[3] Proposes a U-Net-based segmentation framework by adding a VAE branch to reg-
ularize feature learning.
NoNewNet[13] Designs an improved U-Net architecture for segmentation.
Attention Unet[27] Proposes an attention mechanism based on U-Net for segmentation.
U-Net++[28] Adds a series of dense skip connections to the U-Net for segmentation.
N3D[14] Proposes a 3D U-Net for brain tumor segmentation.
Z. Jiang[49] Proposes an end-to-end cascading U-Net architecture for segmentation.
CENET[47] Designs a context extractor to generate more advanced semantic feature maps.
HNF-Net[50] Proposes a 3D high-resolution and non-local feature network for segmentation.
T. Zhou[18] Designs four independent encoding paths to extract features from four modalities
and then fuse them.
D. Zhang[17] Proposes a task-structured brain tumor segmentation network by considering mul-
timodal fusion.
TransBTS[10] Proposes an encoder-decoder structure consisting of Transformer and U-Net for
segmentation.
RFNet[51] Proposes a region-aware fusion network that exploits different combinations of mul-
timodal data.
TransUnet[31] Proposes a universal segmentation framework by combining Transformer and U-
Net.
Point-UNet[48] Designs a U-Net to perform a fine-class segmentation of the input point cloud.
Table 3: Objective evaluation results of different brain tumor segmentation methods on the BraTS2018 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
Myronenko[3] 90.40 4.483 85.90 8.278 81.40 3.805 85.90 5.500
NoNewNet[13] 90.80 4.790 84.32 8.160 79.59 3.120 84.90 5.357
U-Net++[28] 88.96 5.327 84.65 8.535 79.49 4.285 84.36 6.049
CENET[47] 89.53 5.271 84.31 8.493 79.95 4.379 84.60 6.193
D. Zhang[17] 89.60 5.733 82.40 9.270 78.20 3.567 83.40 6.190
TransUnet[31] 90.25 4.390 87.19 5.539 80.41 3.731 85.95 4.553
Point-UNet[48] 90.55 - 87.09 - 80.76 - 86.13 6.010
Proposed 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
methods. By referring to the ground truth (GT), the proposed method achieves more accurate segmentation
results and especially for the tumor edges than other methods, which demonstrates the effectiveness of the325
edge features extracted for segmentation.
Fig. 9 illustrates an example to compare the performance of different segmentation methods in terms of
tumor boundary accuracy. As mentioned above, the metric HD is more sensitive to the boundary shape, so
the corresponding HD scores of whole tumor are provided as well. Among all the methods, the proposed
one obtains the best results in terms of both the HD scores and the visual effect. These results further show330
17
Table 4: Objective evaluation results of different brain tumor segmentation methods on the BraTS2019 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
Attention Unet[27] 88.81 7.756 77.20 8.258 75.96 5.202 80.66 7.072
U-Net++[28] 89.67 6.345 87.13 5.521 80.25 3.313 85.68 5.060
Z. Jiang[49] 90.94 4.263 86.47 5.439 80.21 3.146 85.87 4.283
N3D[14] 91.60 6.547 88.80 6.219 83.00 3.543 87.80 5.436
HNF-Net[50] 91.11 4.136 86.40 5.250 80.96 3.490 86.16 4.292
T. Zhou[18] 89.70 6.700 77.50 9.300 70.60 7.400 79.27 7.800
TransBTS[10] 90.00 5.644 81.94 6.049 78.93 3.736 83.62 5.143
Proposed 91.58 3.866 89.24 5.118 83.84 3.080 88.22 4.021
Table 5: Objective evaluation results of different brain tumor segmentation methods on the BraTS2020 benchmark.
BraTS2020 WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
U-Net++[28] 89.77 6.299 85.57 5.483 79.83 4.328 85.06 5.370
Point-UNet[48] 89.67 - 82.97 - 76.43 - 83.02 8.260
TransBTS[10] 90.09 4.964 81.73 9.769 78.73 17.947 83.52 10.893
RFNet[51] 91.11 - 85.21 - 78.00 - 84.77 -
Proposed 91.03 4.719 88.22 5.985 84.61 3.051 87.95 4.585
that edge features can benefit the brain tumor segmentation task.
4.4. Ablation Study
To further verify the effectiveness of the main components including the shifted patch tokenization
strategy, the edge detection module and the MFIB in the feature fusion module that are designed in our
method, a ablation study is conducted in this subsection.335
Table 6: Objective evaluation results for the ablation study on the BraTS2018 benchmark.
WT TC ET Average
Dice HD Dice HD Dice HD Dice HD
SwinTrans 88.97 6.276 85.72 6.563 80.31 4.364 85.00 5.734
SwinTrans+SPD 89.00 5.720 85.95 6.453 80.62 4.338 85.19 5.504
SwinTrans+SPD+ED 89.93 4.259 87.24 5.398 81.19 3.728 85.95 4.462
Proposed 90.89 3.923 87.96 5.217 81.94 3.440 86.93 4.193
In this experiment, we use the standard Swin Transformer [11] as the baseline model, and then adds
different components one by one to validate their effect. Specifically, we mainly compare the performance
of the following four models:
18
SwinTrans SwinTrans+SPD SwinTrans+SPD+ED
Input Image
Completed Model
Groud Truth
Figure 10: Visual effect comparison of segmentation results obtained by different models in the ablation study.
19
-SwinTrans: Just using the Swin Transformer for brain tumor segmentation via a well pre-trained
model. This is the baseline model.340
-SwinTrans+SPD: Introducing the shifted patch tokenization strategy into the Swin Transformer for
segmentation without using pre-training.
-SwinTrans+SPD+ED: Further adding the edge detection module based on the above model. This
model has similar framework to the proposed one, but it just simply concatenate the semantic and
edge features for fusion, instead of using the proposed MFIB.345
-Completed Model: The complete model (i.e., SwinTrans+SPD+ED+MFIB) proposed in this paper.
Therefore, the comparison between SwinTrans and SwinTrans+SPD is used to demonstrate the effective-
ness of the shifted patch tokenization strategy adopted in the Swin Transformer. The comparison between
SwinTrans+SPD and SwinTrans+SPD+ED can validate the effect of the edge detection module (please
kindly note that the ED module cannot be individually used without the semantic segmentation module).350
The comparison between SwinTrans+SP+ED and Completed Model is used to show the effectiveness of the
designed MFIB for feature fusion.
Table 6 lists the objective performance of different models. We can see that the the each of the above
components leads to some improvements of the segmentation results. Among them, the effect of adding
edge detection module and using MFIB for feature fusion is more obvious.355
The visual effect comparison of segmentation results obtained by different models in the ablation study
is shown in Fig. 10. Some interesting observations include: 1) After adding the shifted patch, partial
unnecessary disturbance noise is eliminated. 2) After adding the edge detection module, the segmentation
accuracy of tumor edges is obviously higher when compared with the GT. 3) After adding the MFIB for
feature fusion, the tumor edges are visually more natural than the simple concatenating manner.360
5. Conclusion
This paper proposes a novel deep learning-based brain tumor segmentation method by jointly utilizing
deep semantics and edge information in multimodal MRI. To achieve this target, three functional modules are
designed. Specifically, we present a semantic segmentation module based on an improved Swin Transformer
by introducing the shifted patch tokenization strategy for better training. In addition, a CNN-based edge365
detection module is designed to extract edge features from the input MRI scans. Finally, we present a
feature fusion module by designing a multi-feature inference block based on graph convolution to fuse the
deep semantic edges and specific edge features. Experimental results demonstrate the effectiveness of the
key components designed in our method. Moreover, the proposed method achieves better performance
when compared with some state-of-the-art methods on the BraTS benchmarks. Future work may focus370
20
on exploring the feasibility of some other specific features for brain tumor segmentation and extending the
proposed approach to other semantic segmentation problems.
References
[1] F. Piccialli, V. Di Somma, F. Giampaolo, S. Cuomo, G. Fortino, A survey on deep learning in medicine: Why, how and
when?, Information Fusion 66 (2021) 111–137.375
[2] S. Mo, M. Cai, L. Lin, R. Tong, Q. Chen, F. Wang, H. Hu, Y. Iwamoto, X.-H. Han, Y.-W. Chen, Multimodal priors
guided segmentation of liver lesions in mri using mutual information based graph co-attention networks, in: International
Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2020, pp. 429–438.
[3] A. Myronenko, 3d mri brain tumor segmentation using autoencoder regularization, in: International MICCAI Brainlesion
Workshop, Springer, 2018, pp. 311–320.380
[4] A. Barredo Arrieta, N. D ˜
Aaz-Rodr ˜
Aguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez,
D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (xai): Concepts, taxonomies, oppor-
tunities and challenges toward responsible ai, Information Fusion 58 (2020) 82–115.
[5] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2015, pp. 3431–3440.385
[6] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International
Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
[7] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmen-
tation, in: 2016 fourth international conference on 3D vision (3DV), IEEE, 2016, pp. 565–571.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you390
need, Advances in neural information processing systems 30.
[9] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint
arXiv:2010.11929.
[10] W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, J. Li, Transbts: Multimodal brain tumor segmentation using transformer, in:395
International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2021, pp. 109–119.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using
shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[12] S. H. Lee, S. Lee, B. C. Song, Vision transformer for small-size datasets, arXiv preprint arXiv:2112.13492.
[13] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, K. H. Maier-Hein, No new-net, in: International MICCAI Brainlesion400
Workshop, Springer, 2018, pp. 234–244.
[14] F. Wang, R. Jiang, L. Zheng, C. Meng, B. Biswal, 3d u-net based brain tumor segmentation and survival days prediction,
in: International MICCAI Brainlesion Workshop, Springer, 2019, pp. 131–141.
[15] T. Cheng, X. Wang, L. Huang, W. Liu, Boundary-preserving mask r-cnn, in: European conference on computer vision,
Springer, 2020, pp. 660–676.405
[16] D. Acuna, A. Kar, S. Fidler, Devil is in the edges: Learning semantic boundaries from noisy annotations, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11075–11083.
[17] D. Zhang, G. Huang, Q. Zhang, J. Han, J. Han, Y. Wang, Y. Yu, Exploring task structure for brain tumor segmentation
from multi-modality mr images, IEEE Transactions on Image Processing 29 (2020) 9032–9043.
[18] T. Zhou, S. Ruan, Y. Guo, S. Canu, A multi-modality fusion network based on attention mechanism for brain tumor410
segmentation, in: 2020 IEEE 17th international symposium on biomedical imaging (ISBI), IEEE, 2020, pp. 377–380.
21
[19] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M. Jodoin, H. Larochelle, Brain tumor
segmentation with deep neural networks, Medical image analysis 35 (2017) 18–31.
[20] B. H. Menze, K. v. Leemput, D. Lashkari, M.-A. Weber, N. Ayache, P. Golland, A generative model for brain tumor
segmentation in multi-modal images, in: International Conference on Medical Image Computing and Computer-Assisted415
Intervention, Springer, 2010, pp. 151–159.
[21] M. P. Heinrich, O. Maier, H. Handels, Multi-modal multi-atlas segmentation using discrete optimisation and self-
similarities., VISCERAL Challenge@ ISBI 1390 (2015) 27.
[22] M. Goetz, C. Weber, J. Bloecher, B. Stieltjes, H.-P. Meinzer, K. Maier-Hein, Extremely randomized trees based brain
tumor segmentation, Proceeding of BRATS challenge-MICCAI (2014) 006–011.420
[23] N. K. Subbanna, D. Precup, D. L. Collins, T. Arbel, Hierarchical probabilistic gabor and mrf segmentation of brain
tumours in mri volumes, in: International conference on medical image computing and computer-assisted intervention,
Springer, 2013, pp. 751–758.
[24] D. Zikic, B. Glocker, E. Konukoglu, A. Criminisi, C. Demiralp, J. Shotton, O. M. Thomas, T. Das, R. Jena, S. J. Price,
Decision forests for tissue-specific segmentation of high-grade gliomas in multi-channel mr, in: International Conference425
on Medical Image Computing and Computer-Assisted Intervention, Springer, 2012, pp. 369–376.
[25] W. Wu, A. Y. Chen, L. Zhao, J. J. Corso, Brain tumor detection and segmentation in a crf (conditional random fields)
framework with pixel-pairwise affinity and superpixel-level features, International journal of computer assisted radiology
and surgery 9 (2) (2014) 241–253.
[26] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient430
multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, Medical image analysis 36 (2017) 61–78.
[27] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz,
et al., Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999.
[28] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, Unet++: Redesigning skip connections to exploit multiscale features
in image segmentation, IEEE transactions on medical imaging 39 (6) (2019) 1856–1867.435
[29] Y. Liu, F. Mu, Y. Shi, X. Chen, Sf-net: A multi-task model for brain tumor segmentation in multimodal mri via image
fusion, IEEE Signal Processing Letters 29 (2022) 1799–1803.
[30] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-
based biomedical image segmentation, Nature methods 18 (2) (2021) 203–211.
[31] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, Y. Zhou, Transunet: Transformers make strong440
encoders for medical image segmentation, arXiv preprint arXiv:2102.04306.
[32] J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, V. M. Patel, Medical transformer: Gated axial-attention for medical image
segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2021, pp. 36–46.
[33] S. Pereira, A. Pinto, V. Alves, C. A. Silva, Brain tumor segmentation using convolutional neural networks in mri images,445
IEEE transactions on medical imaging 35 (5) (2016) 1240–1251.
[34] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmentation using cascaded anisotropic convolu-
tional neural networks, in: International MICCAI brainlesion workshop, Springer, 2017, pp. 178–190.
[35] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, I. B. Ayed, Hyperdense-net: a hyper-densely connected cnn
for multi-modal image segmentation, IEEE transactions on medical imaging 38 (5) (2018) 1116–1126.450
[36] Y. Liu, F. Mu, Y. Shi, J. Cheng, C. Li, X. Chen, Brain tumor segmentation in multimodal mri via pixel-level and
feature-level image fusion, Frontiers in Neuroscience 16 (2022) 1000587.
[37] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, Z. He, Modality-aware mutual learning for multi-modal medical
image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention,
22
Springer, 2021, pp. 589–599.455
[38] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, X. Bai, Richer convolutional features for edge detection, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp. 3000–3009.
[39] X. Chen, C. Dong, J. Ji, J. Cao, X. Li, Image manipulation detection by multi-view multi-scale supervision, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14185–14193.
[40] L. Jiao, J. Chen, F. Liu, S. Yang, C. You, X. Liu, L. Li, B. Hou, Graph representation learning meets computer vision: A460
survey, IEEE Transactions on Artificial Intelligence.
[41] Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, Y. Kalantidis, Graph-based global reasoning networks, in: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 433–442.
[42] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907.
[43] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest,465
et al., The multimodal brain tumor image segmentation benchmark (brats), IEEE transactions on medical imaging 34 (10)
(2014) 1993–2024.
[44] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, C. Davatzikos, Advancing
the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features, Scientific data 4 (1)
(2017) 1–13.470
[45] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki,
et al., Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall
survival prediction in the brats challenge, arXiv preprint arXiv:1811.02629.
[46] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
[47] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, J. Liu, Ce-net: Context encoder network for 2d475
medical image segmentation, IEEE transactions on medical imaging 38 (10) (2019) 2281–2292.
[48] N.-V. Ho, T. Nguyen, G.-H. Diep, N. Le, B.-S. Hua, Point-unet: A context-aware point-based neural network for volumetric
segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2021, pp. 644–655.
[49] Z. Jiang, C. Ding, M. Liu, D. Tao, Two-stage cascaded u-net: 1st place solution to brats challenge 2019 segmentation480
task, in: International MICCAI brainlesion workshop, Springer, 2019, pp. 231–241.
[50] H. Jia, Y. Xia, W. Cai, H. Huang, Learning high-resolution and efficient non-local features for brain glioma segmentation
in mr images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer,
2020, pp. 480–490.
[51] Y. Ding, X. Yu, Y. Yang, Rfnet: Region-aware fusion network for incomplete multi-modal brain tumor segmentation, in:485
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3975–3984.
23
... DS-TransUNet [38], CrossViT [39], and MedVit [40] introduce dual-branch transformer architectures to extract contextual information, enhance fine-grained features, and improve semantic segmentation. In processing of medical photographs, researchers explore multi-branch architectures to address challenges, such as Valanarasu et al. [33] using undercomplete and overcomplete encoders, Lin et al. [38] employing the Swin Transformer with a dual-branch, and Zhu et al. [41] leveraging different branches for processing various data modalities and achieving superior performance. These multi-branch models show promise in tackling complex features in medical image processing. ...
... These Transformer-based approaches show promise in overcoming the limitations of CNN models, which struggle with long-range dependency modeling due to their restricted receptive fields. By incorporating both global and local information, these hybrid models aim to improve feature extraction for 3D and 2D segmentation of medical images [41,44,45]. ...
Article
Full-text available
Accurate tooth segmentation is of paramount importance in oral healthcare because it provides critical positional data for clinical diagnosis, orthodontic treatment, and surgical procedures. Despite the widespread use of convolutional neural networks (CNNs) in image segmentation, its limitations in collecting complete global context and long-range interactions are acknowledged. Although vision transformers show promise in understanding larger contextual information, they struggle to manage the spatial complexities of medical images. To tackle these issues, the proposed DenUnet leverages the strengths of both CNNs and transformers. It introduces a dual-branch encoder that simultaneously extracts edge and body information and a double-level fusion module for integrating multi-scale features. To ensure the fine fusing of edge and body information derived from the two mentioned encoders, we propose a local cross-attention feature fusion module to enhance feature fusion with accurate blending losses. Experimental results underscore the superior efficacy of DenUnet in comparison to existing methods. We achieved 95.4% accuracy and 92.7% F1-score on the DNS dataset, which is particularly evident in its ability to adeptly handle irregular boundaries of dental datasets. Code is available at https://github.com/Omid-Nejati/DenUnet
... Zhang et al. [29] suggested edge-attention guidance network, which embeds edge-attention representations to navigate the segmentation network, and the edge guidance module is used to learn edge attention representations at the early encoding layer, which are then transmitted to the multi-scale decoding layer and integrated using a weighted aggregation module. Zhu et al. [30] introduced a novel segmentation technique that leverages both deep semantic understanding and edge information from multimodal MRI images, offering a comprehensive approach to accurately delineating brain tumours. Li et al. [31] proposed X-Net framework, which presents a dual encoding-decoding architecture that enhances medical image segmentation by effectively capturing and integrating complex spatial and contextual information across different scales. ...
Article
Full-text available
Recent years have witnessed notable advancements in medical image segmentation through deep convolutional neural networks. However, a notable limitation lies in the local operation of convolution, which hinders the ability to fully exploit global semantic information. To overcome the challenges prevalent in medical image segmentation, the feature ensemble network with multi‐scale atrous transformer is proposed. At the core of the approach lies the multi‐scale contextual integration module, which is based on the multi‐scale atrous transformer and facilitates contextual integration of multi‐level features. To extract discriminative fine‐grained features of the target region, a hybrid attention mechanism that synergistically combines spatial and channel attention, thereby sharpening the model's focus on crucial target information within high‐level features, is incorporated. Additionally, the channel‐aware feature reconstruction module is introduced as an innovative component engineered to tackle feature similarity issues across different categories. This module performs feature reconstruction based on channel perception, effectively widening the feature gap between categories and enhancing the segmentation capability. It is worth mentioning that our approach surpasses the state‐of‐the‐art method using three benchmark datasets in medical image segmentation.
... Cheng et al. [49] proposed a new segmentation evaluation metric-boundary IoU, concentrating on the improvement of boundary quality to augment segmentation precision. Zhu et al. [50] introduced a brain tumor segmentation approach that fuses semantic and edge features, realized through the design of a graph-convolution-based multi-feature inference block. Unlike other methods, our approach assesses label quality by extracting edges from the training data and incorporates a 3D Multiscale Convolutional Attention Module and a quality loss function, effectively increasing segmentation precision. ...
Article
Full-text available
Accurate and automated segmentation of brain tissue images can significantly streamline clinical diagnosis and analysis. Manual delineation needs improvement due to its laborious and repetitive nature, while automated techniques encounter challenges stemming from disparities in magnetic resonance imaging (MRI) acquisition equipment and accurate labeling. Existing software packages, such as FSL and FreeSurfer, do not fully replace ground truth segmentation, highlighting the need for an efficient segmentation tool. To better capture the essence of cerebral tissue, we introduce nnSegNeXt, an innovative segmentation architecture built upon the foundations of quality assessment. This pioneering framework effectively addresses the challenges posed by missing and inaccurate annotations. To enhance the model’s discriminative capacity, we integrate a 3D convolutional attention mechanism instead of conventional convolutional blocks, enabling simultaneous encoding of contextual information through the incorporation of multiscale convolutional features. Our methodology was evaluated on four multi-site T1-weighted MRI datasets from diverse sources, magnetic field strengths, scanning parameters, temporal instances, and neuropsychiatric conditions. Empirical evaluations on the HCP, SALD, and IXI datasets reveal that nnSegNeXt surpasses the esteemed nnUNet, achieving Dice coefficients of 0.992, 0.987, and 0.989, respectively, and demonstrating superior generalizability across four distinct projects with Dice coefficients ranging from 0.967 to 0.983. Additionally, extensive ablation studies have been implemented to corroborate the effectiveness of the proposed model. These findings represent a notable advancement in brain tissue analysis, suggesting that nnSegNeXt holds the promise to significantly refine clinical workflows.
... The data characteristics are shown in Table 2, and the data correlations are shown in Eqs 1-4. The data collected by the vehicle detection sensors and visibility detection sensors can be calibrated through video images [29][30][31][32][33], as shown in Figure 1. ...
Article
Full-text available
Due to the harsh environment of highway tunnels and frequent breakdowns of various detection sensors and surveillance devices, the operational management of highway tunnels lacks effective data support. This paper analyzes the characteristics of operational surveillance data in highway tunnels. It proposes a multimodal information fusion method based on CNN-LSTM-attention and designs and develops a digital twin for highway tunnel operations. The system addresses issues such as insufficient development and coordination of the technical architecture of operation control systems, weak information service capabilities, and insufficient data application capabilities. The system also lacks intelligent decision-making and control capabilities. The developed system achieves closed-loop management of "accurate perception-risk assessment-decision warning-emergency management" for highway tunnel operations based on data-driven approaches. The engineering demonstration application underscores the system's capacity to enhance tunnel traffic safety, diminish tunnel management costs, and elevate tunnel driving comfort.
... The patch-based approach also helped mitigate memory constraints associated with processing large 3D medical images. [31] proposed a technique based on the fusion of deep semantics and edge information in multimodal MRI. ...
Preprint
Full-text available
Brain tumor segmentation is a critical task in medical image analysis, aiding in the diagnosis and treatment planning of brain tumor patients. The importance of automated and accurate brain tumor segmentation cannot be overstated. It enables medical professionals to precisely delineate tumor regions, assess tumor growth or regression, and plan targeted treatments. Various deep learning-based techniques proposed in the literature have made significant progress in this field, however, they still face limitations in terms of accuracy due to the complex and variable nature of brain tumor morphology. In this research paper, we propose a novel Hybrid Multihead Attentive U-Net architecture, to address the challenges in accurate brain tumor segmentation, and to capture complex spatial relationships and subtle tumor boundaries. The U-Net architecture has proven effective in capturing contextual information and feature representations, while attention mechanisms enhance the model's ability to focus on informative regions and refine the segmentation boundaries. By integrating these two components, our proposed architecture improves accuracy in brain tumor segmentation. We test our proposed model on the BraTS 2020 benchmark dataset and compare its performance with the state-of-the-art well-known SegNet, FCN-8s, and Dense121 U-Net architectures. The results show that our proposed model outperforms the others in terms of the evaluated performance metrics.
Article
Full-text available
Graph neural networks (GNNs) have emerged as a powerful tool for learning on graph-structured data, with applications spanning fields such as social network analysis, drug discovery, and recommendation systems. A crucial step in the effective deployment of GNNs is the careful selection and extraction of relevant features from the input graph data. This paper provides a comprehensive review of the state-of-the-art techniques for feature selection and extraction in the context of GNNs. We first discuss the importance of feature engineering in the success of GNNs, highlighting how the choice of node, edge, and graph-level features can significantly impact model performance. We then survey a range of feature selection methods, including filter-based, wrapper-based, and embedded approaches, and examine their suitability for different GNN architectures and problem domains. Additionally, we explore feature extraction techniques that leverage the inherent structure of graph data, such as graph decomposition, subgraph mining, and graph representation learning. The advantages and limitations of these approaches are discussed, with a focus on their ability to capture meaningful topological and attributional information. Finally, we outline promising directions for future research, including the integration of feature selection and extraction within end-to-end GNN training, the development of unsupervised feature learning methods, and the exploration of transferable feature representations across different graph domains. This review aims to serve as a valuable resource for researchers and practitioners interested in advancing the state-of-the-art in graph-based machine learning. # 1. Introduction ## Overview of Graph Neural Networks (GNNs) Graph Neural Networks (GNNs) are a powerful class of deep learning models designed to operate on graph-structured data. Compared to traditional machine learning techniques, GNNs are uniquely suited to capture the complex relationships and interdependencies present in graph-based data structures.
Article
Full-text available
Gesture recognition using electromyography (EMG) signals has prevailed recently in the field of human–computer interactions for controlling intelligent prosthetics. Currently, machine learning and deep learning are the two most commonly employed methods for classifying hand gestures. Despite traditional machine learning methods already achieving impressive performance, it is still a huge amount of work to carry out feature extraction manually. The existing deep learning methods utilize complex neural network architectures to achieve higher accuracy, which will suffer from overfitting, insufficient adaptability, and low recognition accuracy. To improve the existing phenomenon, a novel lightweight model named dual stream LSTM feature fusion classifier is proposed based on the concatenation of five time-domain features of EMG signals and raw data, which are both processed with one-dimensional convolutional neural networks and LSTM layers to carry out the classification. The proposed method can effectively capture global features of EMG signals using a simple architecture, which means less computational cost. An experiment is conducted on a public DB1 dataset with 52 gestures, and each of the 27 subjects repeats every gesture 10 times. The accuracy rate achieved by the model is 89.66%, which is comparable to that achieved by more complex deep learning neural networks, and the inference time for each gesture is 87.6 ms, which can also be implied in a real-time control system. The proposed model is validated using a subject-wise experiment on 10 out of the 40 subjects in the DB2 dataset, achieving a mean accuracy of 91.74%. This is illustrated by its ability to fuse time-domain features and raw data to extract more effective information from the sEMG signal and select an appropriate, efficient, lightweight network to enhance the recognition results.
Article
Full-text available
Brain tumor segmentation in multimodal MRI volumes is of great significance to disease diagnosis, treatment planning, survival prediction and other relevant tasks. However, most existing brain tumor segmentation methods fail to make sufficient use of multimodal information. The most common way is to simply stack the original multimodal images or their low-level features as the model input, and many methods treat each modality data with equal importance to a given segmentation target. In this paper, we introduce multimodal image fusion technique including both pixel-level fusion and feature-level fusion for brain tumor segmentation, aiming to achieve more sufficient and finer utilization of multimodal information. At the pixel level, we present a convolutional network named PIF-Net for 3D MR image fusion to enrich the input modalities of the segmentation model. The fused modalities can strengthen the association among different types of pathological information captured by multiple source modalities, leading to a modality enhancement effect. At the feature level, we design an attention-based modality selection feature fusion (MSFF) module for multimodal feature refinement to address the difference among multiple modalities for a given segmentation target. A two-stage brain tumor segmentation framework is accordingly proposed based on the above components and the popular V-Net model. Experiments are conducted on the BraTS 2019 and BraTS 2020 benchmarks. The results demonstrate that the proposed components on both pixel-level and feature-level fusion can effectively improve the segmentation accuracy of brain tumors.
Chapter
Full-text available
Medical image analysis using deep learning has recently been prevalent, showing great performance for various downstream tasks including medical image segmentation and its sibling, volumetric image segmentation. Particularly, a typical volumetric segmentation network strongly relies on a voxel grid representation which treats volumetric data as a stack of individual voxel ‘slices’, which allows learning to segment a voxel grid to be as straightforward as extending existing image-based segmentation networks to the 3D domain. However, using a voxel grid representation requires a large memory footprint, expensive test-time and limiting the scalability of the solutions. In this paper, we propose Point-Unet, a novel method that incorporates the efficiency of deep learning with 3D point clouds into volumetric segmentation. Our key idea is to first predict the regions of interest in the volume by learning an attentional probability map, which is then used for sampling the volume into a sparse point cloud that is subsequently segmented using a point-based neural network. We have conducted the experiments on the medical volumetric segmentation task with both a small-scale dataset Pancreas and large-scale datasets BraTS18, BraTS19, and BraTS20 challenges. A comprehensive benchmark on different metrics has shown that our context-aware Point-Unet robustly outperforms the SOTA voxel-based networks at both accuracies, memory usage during training, and time consumption during testing. Our code is available at https://github.com/VinAIResearch/Point-Unet.
Article
Automatic segmentation of brain tumor regions from multimodal MRI scans is of great clinical significance. In this letter, we propose a “Segmentation-Fusion” multi-task model named SF-Net for brain tumor segmentation. In comparison to the widely-used multi-task model that adds a variational autoencoder (VAE) decoder to reconstruct the input data, using image fusion as an additional regularization for feature learning helps to achieve more sufficient fusion of multimodal features, which is beneficial to the multimodal image segmentation problem. To further improve the performance of the multi-task model, an uncertainty-based approach that can adaptively adjust the loss weights of different tasks during the training process is introduced for model training. Experimental results on the BraTS 2020 benchmark demonstrate that the proposed method can achieve higher segmentation accuracy than the VAE-based approach. In addition, as the by-product of the multi-task model, the image fusion results obtained are of high quality on the brain tumor regions. The source code of the proposed method is available at https://github.com/yuliu316316/SF-Net .
Article
Graph structure is a powerful mathematical abstraction, which can not only represent information about individuals, but also capture the interactions between individuals for reasoning. Geometric modeling and relational inference based on graph data is a long-standing topic of interest in the computer vision community. In this paper, we provide a systematic review of graph representation learning and its applications in computer vision. First, we sort out the evolution of representation learning on graphs, categorizing them into the non-neural network and neural network methods based on the way the nodes are encoded. Specifically, non-neural network methods, such as graph embedding and probabilistic graphical models, are introduced, and neural network methods, such as Graph Recurrent Neural Networks (Graph RNN), Graph Convolutional Networks (GCN) and Variants of Graph Neural Networks are also respectively presented. Then, we organize the applications of graph representation algorithms in various vision tasks (such as image classification, semantic segmentation, object detection and and tracking, etc.) for review and reference, and the typical graph construction approaches in computer vision are also summarized. Finally, on the background of biology and brain inspiration, we discuss the existing challenges and future directions of graph representation learning and computer vision.
Chapter
Liver cancer is one of the most common cancers worldwide. Due to inconspicuous texture changes of liver tumor, contrast-enhanced computed tomography (CT) imaging is effective for the diagnosis of liver cancer. In this paper, we focus on improving automated liver tumor segmentation by integrating multi-modal CT images. To this end, we propose a novel mutual learning (ML) strategy for effective and robust multi-modal liver tumor segmentation. Different from existing multi-modal methods that fuse information from different modalities by a single model, with ML, an ensemble of modality-specific models learn collaboratively and teach each other to distill both the characteristics and the commonality between high-level representations of different modalities. The proposed ML not only enables the superiority for multi-modal learning but can also handle missing modalities by transferring knowledge from existing modalities to missing ones. Additionally, we present a modality-aware (MA) module, where the modality-specific models are interconnected and calibrated with attention weights for adaptive information exchange. The proposed modality-aware mutual learning (MAML) method achieves promising results for liver tumor segmentation on a large-scale clinical dataset. Moreover, we show the efficacy and robustness of MAML for handling missing modalities on both the liver tumor and public brain tumor (BRATS 2018) datasets. Our code is available at https://github.com/YaoZhang93/MAML.
Chapter
Over the past decade, deep convolutional neural networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to inherent inductive biases present in convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed transformer-based architectures that leverage self-attention mechanism encode long-range dependencies and learn representations that are highly expressive. This motivates us to explore transformer-based solutions and study the feasibility of using transformer-based network architectures for medical image segmentation tasks. Majority of existing transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, in medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical imaging applications. To this end, we propose a gated axial-attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code: https://github.com/jeya-maria-jose/Medical-Transformer